JCIFS connector's document fingerprinting feature is not general enough
-----------------------------------------------------------------------

                 Key: CONNECTORS-16
                 URL: https://issues.apache.org/jira/browse/CONNECTORS-16
             Project: Lucene Connector Framework
          Issue Type: Improvement
          Components: Framework agents process, Framework crawler agent, GTS 
connector, JCIFS connector, LiveLink connector, Lucene/SOLR connector, Meridio 
connector, RSS connector, SharePoint connector, Web connector
            Reporter: Karl Wright
            Priority: Minor


The JCIFS connector has a feature, called "fingerprinting", which allows it to 
classify documents according to ability of the back-end to index that content.  
Right at the moment, this fingerprinter is capable of recognizing PDFs, 
Microsoft Office files, and text files as being indexable.  One could imagine, 
though, that different SOLR plugins, etc. might have more capability than that. 
 Also, other connectors could potentially benefit from similar technology, 
specifically any connector that deals with binary documents.

One approach to solving this problem would be to remove the feature entirely, 
and allow whatever pipeline exists in SOLR determine the indexability after the 
fact.  The reason this feature was added at MetaCarta, however, is that it may 
be possible to exclude an un-useful document without having to fetch the whole 
thing, and (at least for MetaCarta clients) the number of unindexable files of 
gigantic size was a big concern.

Another approach might be to tie the functionality in with the output connector 
interface, so that an output connector would (somehow) determine applicability 
of a document.  This would require some care to make it possible to fingerprint 
without having to download the entire document, but would otherwise have the 
correct overall structure.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to