Support for content extraction?

阿部慎一朗 Wed, 02 Mar 2011 03:24:14 -0800

Hello.
I want to use the output into solr by ManifoldCF.
My crawling target is files of windows shares repository.
I think that this framework can obtain paths, security, and metadata of those 
files by executing jobs.
But, It can not extract text content in crawling files, and can not be 
attributes of solr output, probably. For example, text data of MS excel or PDF 
documents.
It need to include framework like Tika, if it implements text content 
exrtraction on ManifoldCF.
Is this idea correct?　Or any ideas, please. Thanks.

Support for content extraction?

Reply via email to