[ 
https://issues.apache.org/jira/browse/CONNECTORS-1317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15294748#comment-15294748
 ] 

Karl Wright commented on CONNECTORS-1317:
-----------------------------------------

I have verified that the attached file is successfully extracted by Tika on a 
trunk build:

{code}
Start Time      Activity        Identifier      Result Code     Bytes   Time    
Result Description
05-21-2016 01:56:01.173 output notification (null)              OK      0       
1       
05-21-2016 01:55:51.176 job end 1463810108510(test)
0       1       
05-21-2016 01:55:43.736 document ingest (null)  file:/C:/testdata/something.zip
OK      364     1       
05-21-2016 01:55:43.504 extract [tika]  file:/C:/testdata/something.zip
OK      364     216     
05-21-2016 01:55:43.233 read document   C:\testdata\something.zip
OK      19806   507     
05-21-2016 01:55:41.211 read document   C:\testdata
OK      0       1       
05-21-2016 01:55:41.133 job start       1463810108510(test)
0       1       
{code}

I therefore strongly suggest you check out trunk and build it.  Instructions 
are provided on the "how to build and deploy" page on the web site.




> Hang crawling job on some ZIP documents
> ---------------------------------------
>
>                 Key: CONNECTORS-1317
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1317
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: File system connector
>    Affects Versions: ManifoldCF 2.3
>         Environment: Ubuntu 14.04 Linux 3.13.0-86-generic i686 i686
> java version "1.8.0_31"
> Java(TM) SE Runtime Environment (build 1.8.0_31-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)
> DB: Postgres 9.5.1
>            Reporter: Mr.Keuz
>            Assignee: Karl Wright
>
> I use ManifolCF as file crawler. But I found, that crawling process hangs on 
> some zip files. Although some files parsing normally. 
> Steps: 
> 1. Run ManfoldCF by  "example/start.sh" and Posgres as DB
> 2. Create manifold pipeline: File -> Tika -> Solr
> 3. Put zip file in folder (in attach below)
> 4. Run job
> Here zip file that should reproduce bug: 
> "ManifoldCF_ISSUE_Dive.Into.Python.3.Mark.Pilgrim.2009.zip"
> https://yadi.sk/d/0uSdrR5GrsgmG 
> Note:
> As I investigated (by strace) - crawler process tries to open and parse same 
> zip file again and again (it seems from different workers threads). And It 
> seems that document not removes from queue.
> I am newbie in ManifoldCF, so it is hard task to me to find problem in source 
> code.
> I can send some additional info if needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to