Re: Support for content extraction?

Karl Wright Wed, 02 Mar 2011 05:36:12 -0800

Hi,
Your statement is correct.
Text extraction is typically done on the search engine side - the Solr
output connector, for instance, sends documents into the extracting
update request handler, which uses Solr Cell (Tika) to extract their
text contents.


Karl

2011/3/2 阿部 慎一朗 <[email protected]>:
> Hello.
> I want to use the output into solr by ManifoldCF.
> My crawling target is files of windows shares repository.
> I think that this framework can obtain paths, security, and metadata of those 
> files by executing jobs.
> But, It can not extract text content in crawling files, and can not be 
> attributes of solr output, probably. For example, text data of MS excel or 
> PDF documents.
> It need to include framework like Tika, if it implements text content 
> exrtraction on ManifoldCF.
> Is this idea correct?　Or any ideas, please. Thanks.
>
>

Re: Support for content extraction?

Reply via email to