Hi, Your statement is correct. Text extraction is typically done on the search engine side - the Solr output connector, for instance, sends documents into the extracting update request handler, which uses Solr Cell (Tika) to extract their text contents.
Karl 2011/3/2 阿部 慎一朗 <[email protected]>: > Hello. > I want to use the output into solr by ManifoldCF. > My crawling target is files of windows shares repository. > I think that this framework can obtain paths, security, and metadata of those > files by executing jobs. > But, It can not extract text content in crawling files, and can not be > attributes of solr output, probably. For example, text data of MS excel or > PDF documents. > It need to include framework like Tika, if it implements text content > exrtraction on ManifoldCF. > Is this idea correct? Or any ideas, please. Thanks. > >
