[ 
https://issues.apache.org/jira/browse/CONNECTORS-1317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15294767#comment-15294767
 ] 

Karl Wright commented on CONNECTORS-1317:
-----------------------------------------

Hi Mr. Keuz,

ManifoldCF's many threads do not understand this particular exception.  It is 
part of MCF's design that when something bad happens and it doesn't know what 
it is, it restarts the thread in question, rather than leaving the crawler in a 
bad state.  That is why you see this kind of behavior.

In the ManifoldCF world, it is critical for individual connectors to 
characterize the kinds of exceptions that they throw for this reason.  But for 
exceptions that are unexpected (as this one is), by definition the connector 
cannot characterize the exception properly, because it is unexpected.  If an 
exception *was* expected, then one must ask why not fix the actual problem 
instead.

> Hang crawling job on some ZIP documents
> ---------------------------------------
>
>                 Key: CONNECTORS-1317
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1317
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: File system connector
>    Affects Versions: ManifoldCF 2.3
>         Environment: Ubuntu 14.04 Linux 3.13.0-86-generic i686 i686
> java version "1.8.0_31"
> Java(TM) SE Runtime Environment (build 1.8.0_31-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)
> DB: Postgres 9.5.1
>            Reporter: Mr.Keuz
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 2.5
>
>
> I use ManifolCF as file crawler. But I found, that crawling process hangs on 
> some zip files. Although some files parsing normally. 
> Steps: 
> 1. Run ManfoldCF by  "example/start.sh" and Posgres as DB
> 2. Create manifold pipeline: File -> Tika -> Solr
> 3. Put zip file in folder (in attach below)
> 4. Run job
> Here zip file that should reproduce bug: 
> "ManifoldCF_ISSUE_Dive.Into.Python.3.Mark.Pilgrim.2009.zip"
> https://yadi.sk/d/0uSdrR5GrsgmG 
> Note:
> As I investigated (by strace) - crawler process tries to open and parse same 
> zip file again and again (it seems from different workers threads). And It 
> seems that document not removes from queue.
> I am newbie in ManifoldCF, so it is hard task to me to find problem in source 
> code.
> I can send some additional info if needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to