[jira] [Comment Edited] (CONNECTORS-1494) Error crawling file system with file names having special characters.
[ https://issues.apache.org/jira/browse/CONNECTORS-1494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358161#comment-16358161 ] Vinay edited comment on CONNECTORS-1494 at 2/13/18 12:16 PM: - Thanks Karl. Finally figured out the solution. I had to change the default locale configuration for linux. I edited /etc/sysconfig/i18n and changed LANG="en_US.ISO8859". Now it is picking those files. was (Author: vinaybs...@gmail.com): Thanks Karl. Finally figured out the solution. I had to change the default locale configuration for linux. I edited /etc/sysconfig/i18n and changed LANG="en_US.UTF-8". Now it is picking those files. > Error crawling file system with file names having special characters. > - > > Key: CONNECTORS-1494 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1494 > Project: ManifoldCF > Issue Type: Bug > Components: File system connector >Affects Versions: ManifoldCF 2.9.1 >Reporter: Vinay >Assignee: Karl Wright >Priority: Critical > Fix For: ManifoldCF 2.10 > > > I am crawling a file system mounted on linux machine. So the Repository > Connection is of type "File System". For some files which has some special > characters, Manifold Cf is not picking such files. > File ex: a_XY-SMnA_ABC_Uuޓࠚϯmӣܼ˵Ҫȳ_֚3ҿؖúشԃԫхրҠë.pdf > exception: java.lang.NumberFormatException: For input string: "" > at > java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) > ~[?:1.8.0_151] > at java.lang.Long.parseLong(Long.java:601) ~[?:1.8.0_151] > at java.lang.Long.(Long.java:965) ~[?:1.8.0_151] > at > org.apache.manifoldcf.agents.transformation.documentfilter.DocumentFilter$SpecPacker.(DocumentFilter.java:513) > ~[?:?] > at > org.apache.manifoldcf.agents.transformation.documentfilter.DocumentFilter.getPipelineDescription(DocumentFilter.java:76) > ~[?:?] > at > org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.getTransformationDescription(IncrementalIngester.java:503) > ~[mcf-agents.jar:?] > at > org.apache.manifoldcf.crawler.system.PipelineSpecification.(PipelineSpecification.java:47) > ~[mcf-pull-agent.jar:?] > at > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:308) > [mcf-pull-agent.jar:?] > FATAL 2018-02-07T23:47:15,927 (Worker thread '2') - Error tossed: For input > string: "" -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1494) Error crawling file system with file names having special characters.
[ https://issues.apache.org/jira/browse/CONNECTORS-1494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358161#comment-16358161 ] Vinay commented on CONNECTORS-1494: --- Thanks Karl. Finally figured out the solution. I had to change the default locale configuration for linux. I edited /etc/sysconfig/i18n and changed LANG="en_US.UTF-8". Now it is picking those files. > Error crawling file system with file names having special characters. > - > > Key: CONNECTORS-1494 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1494 > Project: ManifoldCF > Issue Type: Bug > Components: File system connector >Affects Versions: ManifoldCF 2.9.1 >Reporter: Vinay >Assignee: Karl Wright >Priority: Critical > Fix For: ManifoldCF 2.10 > > > I am crawling a file system mounted on linux machine. So the Repository > Connection is of type "File System". For some files which has some special > characters, Manifold Cf is not picking such files. > File ex: a_XY-SMnA_ABC_Uuޓࠚϯmӣܼ˵Ҫȳ_֚3ҿؖúشԃԫхրҠë.pdf > exception: java.lang.NumberFormatException: For input string: "" > at > java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) > ~[?:1.8.0_151] > at java.lang.Long.parseLong(Long.java:601) ~[?:1.8.0_151] > at java.lang.Long.(Long.java:965) ~[?:1.8.0_151] > at > org.apache.manifoldcf.agents.transformation.documentfilter.DocumentFilter$SpecPacker.(DocumentFilter.java:513) > ~[?:?] > at > org.apache.manifoldcf.agents.transformation.documentfilter.DocumentFilter.getPipelineDescription(DocumentFilter.java:76) > ~[?:?] > at > org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.getTransformationDescription(IncrementalIngester.java:503) > ~[mcf-agents.jar:?] > at > org.apache.manifoldcf.crawler.system.PipelineSpecification.(PipelineSpecification.java:47) > ~[mcf-pull-agent.jar:?] > at > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:308) > [mcf-pull-agent.jar:?] > FATAL 2018-02-07T23:47:15,927 (Worker thread '2') - Error tossed: For input > string: "" -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (CONNECTORS-1494) Error crawling file system with file names having special characters.
[ https://issues.apache.org/jira/browse/CONNECTORS-1494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinay updated CONNECTORS-1494: -- Priority: Critical (was: Major) > Error crawling file system with file names having special characters. > - > > Key: CONNECTORS-1494 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1494 > Project: ManifoldCF > Issue Type: Bug > Components: File system connector >Affects Versions: ManifoldCF 2.9.1 >Reporter: Vinay >Assignee: Karl Wright >Priority: Critical > > I am crawling a file system mounted on linux machine. So the Repository > Connection is of type "File System". For some files which has some special > characters, Manifold Cf is not picking such files. > File ex: a_XY-SMnA_ABC_Uuޓࠚϯmӣܼ˵Ҫȳ_֚3ҿؖúشԃԫхրҠë.pdf > exception: java.lang.NumberFormatException: For input string: "" > at > java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) > ~[?:1.8.0_151] > at java.lang.Long.parseLong(Long.java:601) ~[?:1.8.0_151] > at java.lang.Long.(Long.java:965) ~[?:1.8.0_151] > at > org.apache.manifoldcf.agents.transformation.documentfilter.DocumentFilter$SpecPacker.(DocumentFilter.java:513) > ~[?:?] > at > org.apache.manifoldcf.agents.transformation.documentfilter.DocumentFilter.getPipelineDescription(DocumentFilter.java:76) > ~[?:?] > at > org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.getTransformationDescription(IncrementalIngester.java:503) > ~[mcf-agents.jar:?] > at > org.apache.manifoldcf.crawler.system.PipelineSpecification.(PipelineSpecification.java:47) > ~[mcf-pull-agent.jar:?] > at > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:308) > [mcf-pull-agent.jar:?] > FATAL 2018-02-07T23:47:15,927 (Worker thread '2') - Error tossed: For input > string: "" -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1494) Error crawling file system with file names having special characters.
[ https://issues.apache.org/jira/browse/CONNECTORS-1494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356891#comment-16356891 ] Vinay commented on CONNECTORS-1494: --- Thanks Karl. Though the above solution partially fixes the issue, we still see that manifold cf is not picking the files with name like "a_XY-SMnA_ABC_Uuޓࠚϯmӣܼ˵Ҫȳ_֚3ҿؖúشԃԫхրҠë.pdf" when run from linux machine. No errors on the logs. If the same file is copied to windows machine and run by manifold cf on windows, the file is picked up. Any idea why such files are not being picked up when running on linux? With no error on console, we are unable to figure out. > Error crawling file system with file names having special characters. > - > > Key: CONNECTORS-1494 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1494 > Project: ManifoldCF > Issue Type: Bug > Components: File system connector >Affects Versions: ManifoldCF 2.9.1 >Reporter: Vinay >Assignee: Karl Wright >Priority: Major > > I am crawling a file system mounted on linux machine. So the Repository > Connection is of type "File System". For some files which has some special > characters, Manifold Cf is not picking such files. > File ex: a_XY-SMnA_ABC_Uuޓࠚϯmӣܼ˵Ҫȳ_֚3ҿؖúشԃԫхրҠë.pdf > exception: java.lang.NumberFormatException: For input string: "" > at > java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) > ~[?:1.8.0_151] > at java.lang.Long.parseLong(Long.java:601) ~[?:1.8.0_151] > at java.lang.Long.(Long.java:965) ~[?:1.8.0_151] > at > org.apache.manifoldcf.agents.transformation.documentfilter.DocumentFilter$SpecPacker.(DocumentFilter.java:513) > ~[?:?] > at > org.apache.manifoldcf.agents.transformation.documentfilter.DocumentFilter.getPipelineDescription(DocumentFilter.java:76) > ~[?:?] > at > org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.getTransformationDescription(IncrementalIngester.java:503) > ~[mcf-agents.jar:?] > at > org.apache.manifoldcf.crawler.system.PipelineSpecification.(PipelineSpecification.java:47) > ~[mcf-pull-agent.jar:?] > at > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:308) > [mcf-pull-agent.jar:?] > FATAL 2018-02-07T23:47:15,927 (Worker thread '2') - Error tossed: For input > string: "" -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (CONNECTORS-1494) Error crawling file system with file names having special characters.
[ https://issues.apache.org/jira/browse/CONNECTORS-1494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinay updated CONNECTORS-1494: -- Description: I am crawling a file system mounted on linux machine. So the Repository Connection is of type "File System". For some files which has some special characters, Manifold Cf is not picking such files. File ex: a_XY-SMnA_ABC_Uuޓࠚϯmӣܼ˵Ҫȳ_֚3ҿؖúشԃԫхրҠë.pdf exception: java.lang.NumberFormatException: For input string: "" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) ~[?:1.8.0_151] at java.lang.Long.parseLong(Long.java:601) ~[?:1.8.0_151] at java.lang.Long.(Long.java:965) ~[?:1.8.0_151] at org.apache.manifoldcf.agents.transformation.documentfilter.DocumentFilter$SpecPacker.(DocumentFilter.java:513) ~[?:?] at org.apache.manifoldcf.agents.transformation.documentfilter.DocumentFilter.getPipelineDescription(DocumentFilter.java:76) ~[?:?] at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.getTransformationDescription(IncrementalIngester.java:503) ~[mcf-agents.jar:?] at org.apache.manifoldcf.crawler.system.PipelineSpecification.(PipelineSpecification.java:47) ~[mcf-pull-agent.jar:?] at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:308) [mcf-pull-agent.jar:?] FATAL 2018-02-07T23:47:15,927 (Worker thread '2') - Error tossed: For input string: "" was: I am crawling a file system mounted on linux machine. So the Repository Connection is of type "File System". For some files which has some special characters, Manifold Cf is not picking such files. File ex: 2GHz_XY-SCDMA_ABC_Uuޓࠚϯmӣܼ˵Ҫȳ_֚3ҿؖúشԃԫхրҠë.pdf exception: java.lang.NumberFormatException: For input string: "" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) ~[?:1.8.0_151] at java.lang.Long.parseLong(Long.java:601) ~[?:1.8.0_151] at java.lang.Long.(Long.java:965) ~[?:1.8.0_151] at org.apache.manifoldcf.agents.transformation.documentfilter.DocumentFilter$SpecPacker.(DocumentFilter.java:513) ~[?:?] at org.apache.manifoldcf.agents.transformation.documentfilter.DocumentFilter.getPipelineDescription(DocumentFilter.java:76) ~[?:?] at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.getTransformationDescription(IncrementalIngester.java:503) ~[mcf-agents.jar:?] at org.apache.manifoldcf.crawler.system.PipelineSpecification.(PipelineSpecification.java:47) ~[mcf-pull-agent.jar:?] at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:308) [mcf-pull-agent.jar:?] FATAL 2018-02-07T23:47:15,927 (Worker thread '2') - Error tossed: For input string: "" > Error crawling file system with file names having special characters. > - > > Key: CONNECTORS-1494 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1494 > Project: ManifoldCF > Issue Type: Bug > Components: File system connector >Affects Versions: ManifoldCF 2.9.1 >Reporter: Vinay >Priority: Major > > I am crawling a file system mounted on linux machine. So the Repository > Connection is of type "File System". For some files which has some special > characters, Manifold Cf is not picking such files. > File ex: a_XY-SMnA_ABC_Uuޓࠚϯmӣܼ˵Ҫȳ_֚3ҿؖúشԃԫхրҠë.pdf > exception: java.lang.NumberFormatException: For input string: "" > at > java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) > ~[?:1.8.0_151] > at java.lang.Long.parseLong(Long.java:601) ~[?:1.8.0_151] > at java.lang.Long.(Long.java:965) ~[?:1.8.0_151] > at > org.apache.manifoldcf.agents.transformation.documentfilter.DocumentFilter$SpecPacker.(DocumentFilter.java:513) > ~[?:?] > at > org.apache.manifoldcf.agents.transformation.documentfilter.DocumentFilter.getPipelineDescription(DocumentFilter.java:76) > ~[?:?] > at > org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.getTransformationDescription(IncrementalIngester.java:503) > ~[mcf-agents.jar:?] > at > org.apache.manifoldcf.crawler.system.PipelineSpecification.(PipelineSpecification.java:47) > ~[mcf-pull-agent.jar:?] > at > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:308) > [mcf-pull-agent.jar:?] > FATAL 2018-02-07T23:47:15,927 (Worker thread '2') - Error tossed: For input > string: "" -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (CONNECTORS-1494) Error crawling file system with file names having special characters.
Vinay created CONNECTORS-1494: - Summary: Error crawling file system with file names having special characters. Key: CONNECTORS-1494 URL: https://issues.apache.org/jira/browse/CONNECTORS-1494 Project: ManifoldCF Issue Type: Bug Components: File system connector Affects Versions: ManifoldCF 2.9.1 Reporter: Vinay I am crawling a file system mounted on linux machine. So the Repository Connection is of type "File System". For some files which has some special characters, Manifold Cf is not picking such files. File ex: 2GHz_XY-SCDMA_ABC_Uuޓࠚϯmӣܼ˵Ҫȳ_֚3ҿؖúشԃԫхրҠë.pdf exception: java.lang.NumberFormatException: For input string: "" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) ~[?:1.8.0_151] at java.lang.Long.parseLong(Long.java:601) ~[?:1.8.0_151] at java.lang.Long.(Long.java:965) ~[?:1.8.0_151] at org.apache.manifoldcf.agents.transformation.documentfilter.DocumentFilter$SpecPacker.(DocumentFilter.java:513) ~[?:?] at org.apache.manifoldcf.agents.transformation.documentfilter.DocumentFilter.getPipelineDescription(DocumentFilter.java:76) ~[?:?] at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.getTransformationDescription(IncrementalIngester.java:503) ~[mcf-agents.jar:?] at org.apache.manifoldcf.crawler.system.PipelineSpecification.(PipelineSpecification.java:47) ~[mcf-pull-agent.jar:?] at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:308) [mcf-pull-agent.jar:?] FATAL 2018-02-07T23:47:15,927 (Worker thread '2') - Error tossed: For input string: "" -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1487) Cannot set tika settings in manifold
[ https://issues.apache.org/jira/browse/CONNECTORS-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326813#comment-16326813 ] Vinay commented on CONNECTORS-1487: --- Thanks Karl Wright. Yes. I am using solr as an output connector and have unchecked extact update handler so that solr's tika does not do the parsing again. So from your answer I understand that maximum document length I need to configure in Solr's settings. > Cannot set tika settings in manifold > > > Key: CONNECTORS-1487 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1487 > Project: ManifoldCF > Issue Type: Task > Components: Documentation >Affects Versions: ManifoldCF 2.9 > Environment: linux >Reporter: Vinay >Priority: Major > Attachments: tika.png > > > I have configured a tika extractor in the manifold cf job. The tika works > fine for small files. But for bigger files , I get an error as "Downstream > pipeline rejected document with length 348110". The stage is at > extract['tika']. So where can I change the settings for tika to accept large > files? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (CONNECTORS-1487) Cannot set tika settings in manifold
Vinay created CONNECTORS-1487: - Summary: Cannot set tika settings in manifold Key: CONNECTORS-1487 URL: https://issues.apache.org/jira/browse/CONNECTORS-1487 Project: ManifoldCF Issue Type: Task Components: Documentation Affects Versions: ManifoldCF 2.9 Environment: linux Reporter: Vinay Attachments: tika.png I have configured a tika extractor in the manifold cf job. The tika works fine for small files. But for bigger files , I get an error as "Downstream pipeline rejected document with length 348110". The stage is at extract['tika']. So where can I change the settings for tika to accept large files? -- This message was sent by Atlassian JIRA (v7.6.3#76005)