[jira] [Comment Edited] (CONNECTORS-1494) Error crawling file system with file names having special characters.

2018-02-13 Thread Vinay (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358161#comment-16358161
 ] 

Vinay edited comment on CONNECTORS-1494 at 2/13/18 12:16 PM:
-

Thanks Karl. Finally figured out the solution. I had to change the default 
locale configuration for linux. I edited /etc/sysconfig/i18n and changed 
LANG="en_US.ISO8859". Now it is picking those files.


was (Author: vinaybs...@gmail.com):
Thanks Karl. Finally figured out the solution. I had to change the default 
locale configuration for linux. I edited /etc/sysconfig/i18n and changed 
LANG="en_US.UTF-8". Now it is picking those files.

> Error crawling file system with file names having special characters.
> -
>
> Key: CONNECTORS-1494
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1494
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: File system connector
>Affects Versions: ManifoldCF 2.9.1
>Reporter: Vinay
>Assignee: Karl Wright
>Priority: Critical
> Fix For: ManifoldCF 2.10
>
>
> I am crawling a file system mounted on linux machine. So the Repository 
> Connection is of type "File System". For some files which has some special 
> characters, Manifold Cf is not picking such files.
> File ex: a_XY-SMnA_ABC_Uuޓࠚϯmӣܼ˵Ҫȳ_֚3ҿؖúشԃԫхրҠë.pdf
> exception: java.lang.NumberFormatException: For input string: ""
>      at 
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) 
> ~[?:1.8.0_151]
>      at java.lang.Long.parseLong(Long.java:601) ~[?:1.8.0_151]
>      at java.lang.Long.(Long.java:965) ~[?:1.8.0_151]
>      at 
> org.apache.manifoldcf.agents.transformation.documentfilter.DocumentFilter$SpecPacker.(DocumentFilter.java:513)
>  ~[?:?]
>      at 
> org.apache.manifoldcf.agents.transformation.documentfilter.DocumentFilter.getPipelineDescription(DocumentFilter.java:76)
>  ~[?:?]
>      at 
> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.getTransformationDescription(IncrementalIngester.java:503)
>  ~[mcf-agents.jar:?]
>      at 
> org.apache.manifoldcf.crawler.system.PipelineSpecification.(PipelineSpecification.java:47)
>  ~[mcf-pull-agent.jar:?]
>      at 
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:308) 
> [mcf-pull-agent.jar:?]
>  FATAL 2018-02-07T23:47:15,927 (Worker thread '2') - Error tossed: For input 
> string: ""



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1494) Error crawling file system with file names having special characters.

2018-02-09 Thread Vinay (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358161#comment-16358161
 ] 

Vinay commented on CONNECTORS-1494:
---

Thanks Karl. Finally figured out the solution. I had to change the default 
locale configuration for linux. I edited /etc/sysconfig/i18n and changed 
LANG="en_US.UTF-8". Now it is picking those files.

> Error crawling file system with file names having special characters.
> -
>
> Key: CONNECTORS-1494
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1494
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: File system connector
>Affects Versions: ManifoldCF 2.9.1
>Reporter: Vinay
>Assignee: Karl Wright
>Priority: Critical
> Fix For: ManifoldCF 2.10
>
>
> I am crawling a file system mounted on linux machine. So the Repository 
> Connection is of type "File System". For some files which has some special 
> characters, Manifold Cf is not picking such files.
> File ex: a_XY-SMnA_ABC_Uuޓࠚϯmӣܼ˵Ҫȳ_֚3ҿؖúشԃԫхրҠë.pdf
> exception: java.lang.NumberFormatException: For input string: ""
>      at 
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) 
> ~[?:1.8.0_151]
>      at java.lang.Long.parseLong(Long.java:601) ~[?:1.8.0_151]
>      at java.lang.Long.(Long.java:965) ~[?:1.8.0_151]
>      at 
> org.apache.manifoldcf.agents.transformation.documentfilter.DocumentFilter$SpecPacker.(DocumentFilter.java:513)
>  ~[?:?]
>      at 
> org.apache.manifoldcf.agents.transformation.documentfilter.DocumentFilter.getPipelineDescription(DocumentFilter.java:76)
>  ~[?:?]
>      at 
> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.getTransformationDescription(IncrementalIngester.java:503)
>  ~[mcf-agents.jar:?]
>      at 
> org.apache.manifoldcf.crawler.system.PipelineSpecification.(PipelineSpecification.java:47)
>  ~[mcf-pull-agent.jar:?]
>      at 
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:308) 
> [mcf-pull-agent.jar:?]
>  FATAL 2018-02-07T23:47:15,927 (Worker thread '2') - Error tossed: For input 
> string: ""



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (CONNECTORS-1494) Error crawling file system with file names having special characters.

2018-02-08 Thread Vinay (JIRA)

 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinay updated CONNECTORS-1494:
--
Priority: Critical  (was: Major)

> Error crawling file system with file names having special characters.
> -
>
> Key: CONNECTORS-1494
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1494
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: File system connector
>Affects Versions: ManifoldCF 2.9.1
>Reporter: Vinay
>Assignee: Karl Wright
>Priority: Critical
>
> I am crawling a file system mounted on linux machine. So the Repository 
> Connection is of type "File System". For some files which has some special 
> characters, Manifold Cf is not picking such files.
> File ex: a_XY-SMnA_ABC_Uuޓࠚϯmӣܼ˵Ҫȳ_֚3ҿؖúشԃԫхրҠë.pdf
> exception: java.lang.NumberFormatException: For input string: ""
>      at 
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) 
> ~[?:1.8.0_151]
>      at java.lang.Long.parseLong(Long.java:601) ~[?:1.8.0_151]
>      at java.lang.Long.(Long.java:965) ~[?:1.8.0_151]
>      at 
> org.apache.manifoldcf.agents.transformation.documentfilter.DocumentFilter$SpecPacker.(DocumentFilter.java:513)
>  ~[?:?]
>      at 
> org.apache.manifoldcf.agents.transformation.documentfilter.DocumentFilter.getPipelineDescription(DocumentFilter.java:76)
>  ~[?:?]
>      at 
> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.getTransformationDescription(IncrementalIngester.java:503)
>  ~[mcf-agents.jar:?]
>      at 
> org.apache.manifoldcf.crawler.system.PipelineSpecification.(PipelineSpecification.java:47)
>  ~[mcf-pull-agent.jar:?]
>      at 
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:308) 
> [mcf-pull-agent.jar:?]
>  FATAL 2018-02-07T23:47:15,927 (Worker thread '2') - Error tossed: For input 
> string: ""



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1494) Error crawling file system with file names having special characters.

2018-02-08 Thread Vinay (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356891#comment-16356891
 ] 

Vinay commented on CONNECTORS-1494:
---

Thanks Karl. Though the above solution partially fixes the issue, we still see 
that manifold cf is not picking the files with name like 
"a_XY-SMnA_ABC_Uuޓࠚϯmӣܼ˵Ҫȳ_֚3ҿؖúشԃԫхրҠë.pdf" when run from linux machine. No 
errors on the logs.

If the same file is copied to windows machine and run by manifold cf on 
windows, the file is picked up. Any idea why such files are not being picked up 
when running on linux? With no error on console, we are unable to figure out.

> Error crawling file system with file names having special characters.
> -
>
> Key: CONNECTORS-1494
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1494
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: File system connector
>Affects Versions: ManifoldCF 2.9.1
>Reporter: Vinay
>Assignee: Karl Wright
>Priority: Major
>
> I am crawling a file system mounted on linux machine. So the Repository 
> Connection is of type "File System". For some files which has some special 
> characters, Manifold Cf is not picking such files.
> File ex: a_XY-SMnA_ABC_Uuޓࠚϯmӣܼ˵Ҫȳ_֚3ҿؖúشԃԫхրҠë.pdf
> exception: java.lang.NumberFormatException: For input string: ""
>      at 
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) 
> ~[?:1.8.0_151]
>      at java.lang.Long.parseLong(Long.java:601) ~[?:1.8.0_151]
>      at java.lang.Long.(Long.java:965) ~[?:1.8.0_151]
>      at 
> org.apache.manifoldcf.agents.transformation.documentfilter.DocumentFilter$SpecPacker.(DocumentFilter.java:513)
>  ~[?:?]
>      at 
> org.apache.manifoldcf.agents.transformation.documentfilter.DocumentFilter.getPipelineDescription(DocumentFilter.java:76)
>  ~[?:?]
>      at 
> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.getTransformationDescription(IncrementalIngester.java:503)
>  ~[mcf-agents.jar:?]
>      at 
> org.apache.manifoldcf.crawler.system.PipelineSpecification.(PipelineSpecification.java:47)
>  ~[mcf-pull-agent.jar:?]
>      at 
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:308) 
> [mcf-pull-agent.jar:?]
>  FATAL 2018-02-07T23:47:15,927 (Worker thread '2') - Error tossed: For input 
> string: ""



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (CONNECTORS-1494) Error crawling file system with file names having special characters.

2018-02-08 Thread Vinay (JIRA)

 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinay updated CONNECTORS-1494:
--
Description: 
I am crawling a file system mounted on linux machine. So the Repository 
Connection is of type "File System". For some files which has some special 
characters, Manifold Cf is not picking such files.

File ex: a_XY-SMnA_ABC_Uuޓࠚϯmӣܼ˵Ҫȳ_֚3ҿؖúشԃԫхրҠë.pdf

exception: java.lang.NumberFormatException: For input string: ""
     at 
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) 
~[?:1.8.0_151]
     at java.lang.Long.parseLong(Long.java:601) ~[?:1.8.0_151]
     at java.lang.Long.(Long.java:965) ~[?:1.8.0_151]
     at 
org.apache.manifoldcf.agents.transformation.documentfilter.DocumentFilter$SpecPacker.(DocumentFilter.java:513)
 ~[?:?]
     at 
org.apache.manifoldcf.agents.transformation.documentfilter.DocumentFilter.getPipelineDescription(DocumentFilter.java:76)
 ~[?:?]
     at 
org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.getTransformationDescription(IncrementalIngester.java:503)
 ~[mcf-agents.jar:?]
     at 
org.apache.manifoldcf.crawler.system.PipelineSpecification.(PipelineSpecification.java:47)
 ~[mcf-pull-agent.jar:?]
     at 
org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:308) 
[mcf-pull-agent.jar:?]
 FATAL 2018-02-07T23:47:15,927 (Worker thread '2') - Error tossed: For input 
string: ""

  was:
I am crawling a file system mounted on linux machine. So the Repository 
Connection is of type "File System". For some files which has some special 
characters, Manifold Cf is not picking such files.

File ex: 2GHz_XY-SCDMA_ABC_Uuޓࠚϯmӣܼ˵Ҫȳ_֚3ҿؖúشԃԫхրҠë.pdf

exception: java.lang.NumberFormatException: For input string: ""
    at 
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) 
~[?:1.8.0_151]
    at java.lang.Long.parseLong(Long.java:601) ~[?:1.8.0_151]
    at java.lang.Long.(Long.java:965) ~[?:1.8.0_151]
    at 
org.apache.manifoldcf.agents.transformation.documentfilter.DocumentFilter$SpecPacker.(DocumentFilter.java:513)
 ~[?:?]
    at 
org.apache.manifoldcf.agents.transformation.documentfilter.DocumentFilter.getPipelineDescription(DocumentFilter.java:76)
 ~[?:?]
    at 
org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.getTransformationDescription(IncrementalIngester.java:503)
 ~[mcf-agents.jar:?]
    at 
org.apache.manifoldcf.crawler.system.PipelineSpecification.(PipelineSpecification.java:47)
 ~[mcf-pull-agent.jar:?]
    at 
org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:308) 
[mcf-pull-agent.jar:?]
FATAL 2018-02-07T23:47:15,927 (Worker thread '2') - Error tossed: For input 
string: ""


> Error crawling file system with file names having special characters.
> -
>
> Key: CONNECTORS-1494
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1494
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: File system connector
>Affects Versions: ManifoldCF 2.9.1
>Reporter: Vinay
>Priority: Major
>
> I am crawling a file system mounted on linux machine. So the Repository 
> Connection is of type "File System". For some files which has some special 
> characters, Manifold Cf is not picking such files.
> File ex: a_XY-SMnA_ABC_Uuޓࠚϯmӣܼ˵Ҫȳ_֚3ҿؖúشԃԫхրҠë.pdf
> exception: java.lang.NumberFormatException: For input string: ""
>      at 
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) 
> ~[?:1.8.0_151]
>      at java.lang.Long.parseLong(Long.java:601) ~[?:1.8.0_151]
>      at java.lang.Long.(Long.java:965) ~[?:1.8.0_151]
>      at 
> org.apache.manifoldcf.agents.transformation.documentfilter.DocumentFilter$SpecPacker.(DocumentFilter.java:513)
>  ~[?:?]
>      at 
> org.apache.manifoldcf.agents.transformation.documentfilter.DocumentFilter.getPipelineDescription(DocumentFilter.java:76)
>  ~[?:?]
>      at 
> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.getTransformationDescription(IncrementalIngester.java:503)
>  ~[mcf-agents.jar:?]
>      at 
> org.apache.manifoldcf.crawler.system.PipelineSpecification.(PipelineSpecification.java:47)
>  ~[mcf-pull-agent.jar:?]
>      at 
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:308) 
> [mcf-pull-agent.jar:?]
>  FATAL 2018-02-07T23:47:15,927 (Worker thread '2') - Error tossed: For input 
> string: ""



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (CONNECTORS-1494) Error crawling file system with file names having special characters.

2018-02-08 Thread Vinay (JIRA)
Vinay created CONNECTORS-1494:
-

 Summary: Error crawling file system with file names having special 
characters.
 Key: CONNECTORS-1494
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1494
 Project: ManifoldCF
  Issue Type: Bug
  Components: File system connector
Affects Versions: ManifoldCF 2.9.1
Reporter: Vinay


I am crawling a file system mounted on linux machine. So the Repository 
Connection is of type "File System". For some files which has some special 
characters, Manifold Cf is not picking such files.

File ex: 2GHz_XY-SCDMA_ABC_Uuޓࠚϯmӣܼ˵Ҫȳ_֚3ҿؖúشԃԫхրҠë.pdf

exception: java.lang.NumberFormatException: For input string: ""
    at 
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) 
~[?:1.8.0_151]
    at java.lang.Long.parseLong(Long.java:601) ~[?:1.8.0_151]
    at java.lang.Long.(Long.java:965) ~[?:1.8.0_151]
    at 
org.apache.manifoldcf.agents.transformation.documentfilter.DocumentFilter$SpecPacker.(DocumentFilter.java:513)
 ~[?:?]
    at 
org.apache.manifoldcf.agents.transformation.documentfilter.DocumentFilter.getPipelineDescription(DocumentFilter.java:76)
 ~[?:?]
    at 
org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.getTransformationDescription(IncrementalIngester.java:503)
 ~[mcf-agents.jar:?]
    at 
org.apache.manifoldcf.crawler.system.PipelineSpecification.(PipelineSpecification.java:47)
 ~[mcf-pull-agent.jar:?]
    at 
org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:308) 
[mcf-pull-agent.jar:?]
FATAL 2018-02-07T23:47:15,927 (Worker thread '2') - Error tossed: For input 
string: ""



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1487) Cannot set tika settings in manifold

2018-01-15 Thread Vinay (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326813#comment-16326813
 ] 

Vinay commented on CONNECTORS-1487:
---

Thanks Karl Wright. Yes. I am using solr as an output connector and have 
unchecked extact update handler so that solr's tika does not do the parsing 
again. So from your answer I understand that maximum document length I need to 
configure in Solr's settings.

> Cannot set tika settings in manifold
> 
>
> Key: CONNECTORS-1487
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1487
> Project: ManifoldCF
>  Issue Type: Task
>  Components: Documentation
>Affects Versions: ManifoldCF 2.9
> Environment: linux
>Reporter: Vinay
>Priority: Major
> Attachments: tika.png
>
>
> I have configured a tika extractor in the manifold cf job. The tika works 
> fine for small files. But for bigger files , I get an error as "Downstream 
> pipeline rejected document with length 348110". The stage is at 
> extract['tika']. So where can I change the settings for tika to accept large 
> files?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (CONNECTORS-1487) Cannot set tika settings in manifold

2018-01-15 Thread Vinay (JIRA)
Vinay created CONNECTORS-1487:
-

 Summary: Cannot set tika settings in manifold
 Key: CONNECTORS-1487
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1487
 Project: ManifoldCF
  Issue Type: Task
  Components: Documentation
Affects Versions: ManifoldCF 2.9
 Environment: linux
Reporter: Vinay
 Attachments: tika.png

I have configured a tika extractor in the manifold cf job. The tika works fine 
for small files. But for bigger files , I get an error as "Downstream pipeline 
rejected document with length 348110". The stage is at extract['tika']. So 
where can I change the settings for tika to accept large files?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)