[ 
https://issues.apache.org/jira/browse/CONNECTORS-16?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-16:
-------------------------------------

    Assignee: Karl Wright

> JCIFS connector's document fingerprinting feature is not general enough
> -----------------------------------------------------------------------
>
>                 Key: CONNECTORS-16
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-16
>             Project: Lucene Connector Framework
>          Issue Type: Improvement
>          Components: Framework agents process, Framework crawler agent, GTS 
> connector, JCIFS connector, LiveLink connector, Lucene/SOLR connector, 
> Meridio connector, RSS connector, SharePoint connector, Web connector
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>            Priority: Minor
>
> The JCIFS connector has a feature, called "fingerprinting", which allows it 
> to classify documents according to ability of the back-end to index that 
> content.  Right at the moment, this fingerprinter is capable of recognizing 
> PDFs, Microsoft Office files, and text files as being indexable.  One could 
> imagine, though, that different SOLR plugins, etc. might have more capability 
> than that.  Also, other connectors could potentially benefit from similar 
> technology, specifically any connector that deals with binary documents.
> One approach to solving this problem would be to remove the feature entirely, 
> and allow whatever pipeline exists in SOLR determine the indexability after 
> the fact.  The reason this feature was added at MetaCarta, however, is that 
> it may be possible to exclude an un-useful document without having to fetch 
> the whole thing, and (at least for MetaCarta clients) the number of 
> unindexable files of gigantic size was a big concern.
> Another approach might be to tie the functionality in with the output 
> connector interface, so that an output connector would (somehow) determine 
> applicability of a document.  This would require some care to make it 
> possible to fingerprint without having to download the entire document, but 
> would otherwise have the correct overall structure.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to