[ 
https://issues.apache.org/jira/browse/OAK-10790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18011848#comment-18011848
 ] 

Julian Reschke commented on OAK-10790:
--------------------------------------

Also, oak-run references a different version of commons-csv.

> FullTextBinaryTextExtractor fails to extract text from csv
> ----------------------------------------------------------
>
>                 Key: OAK-10790
>                 URL: https://issues.apache.org/jira/browse/OAK-10790
>             Project: Jackrabbit Oak
>          Issue Type: Task
>            Reporter: Nitin Gupta
>            Assignee: Mohit Kataria
>            Priority: Major
>
> FullTextBinaryTextExtractor incorrectly identifies a text file as CSV and 
> fails while parsing it.
> A text file consisting of content with a strucutre similar to a CSV file like 
> - 
> {code:java}
> a,b
> a,b
> a,b
> a,b
> a,b
> a,b
> a,b
> a,b
> a,b
> a,b
> a,b
> a,b
> a,b
> a,b
> {code}
> even the extension of .txt gets parsed by CSV parser via the 
> AutoDetectorParser in tika.
> This leads to a run time exception in an OSGI based setup if the commons-csv 
> bundle is not provided.
> {code:java}
> Caused by: java.lang.NoClassDefFoundError: org/apache/commons/csv/CSVFormat
>     at 
> org.apache.tika.parser.csv.TextAndCSVParser.parse(TextAndCSVParser.java:169)
>     at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
>     at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
>     at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
>     at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:159)
>     at 
> org.apache.jackrabbit.oak.osgi.TikaExtractionOsgiIT.assertFileContains(TikaExtractionOsgiIT.java:215)
>     at 
> org.apache.jackrabbit.oak.osgi.TikaExtractionOsgiIT.text2(TikaExtractionOsgiIT.java:204)
>     at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method)
>     at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
>     at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.base/java.lang.reflect.Method.invoke(Method.java:568)
>     at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>     at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>     at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>     at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>     at 
> org.ops4j.pax.exam.invoker.junit.internal.ContainerTestRunner.runLeafWithRetry(ContainerTestRunner.java:97)
>     at 
> org.ops4j.pax.exam.invoker.junit.internal.ContainerTestRunner.runChildWithRetry(ContainerTestRunner.java:84)
>     at 
> org.ops4j.pax.exam.invoker.junit.internal.ContainerTestRunner.runChild(ContainerTestRunner.java:75)
>     at 
> org.ops4j.pax.exam.invoker.junit.internal.ContainerTestRunner.runChild(ContainerTestRunner.java:43)
>     at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>     at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>     at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>     at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>     at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>     at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>     at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
>     at org.junit.runner.JUnitCore.run(JUnitCore.java:115)
>     at 
> org.ops4j.pax.exam.invoker.junit.internal.JUnitProbeInvoker.invokeViaJUnit(JUnitProbeInvoker.java:124)
>     ... 25 more {code}
> I will add a test to demonstrate this.
>  
> We should handle this gracefully in oak and maybe use the parser based on the 
> file extension or mime type as a backup for the AutoDetectParser.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to