[ https://issues.apache.org/jira/browse/OAK-10790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mohit Kataria resolved OAK-10790. --------------------------------- Resolution: Fixed > FullTextBinaryTextExtractor fails to extract text from csv > ---------------------------------------------------------- > > Key: OAK-10790 > URL: https://issues.apache.org/jira/browse/OAK-10790 > Project: Jackrabbit Oak > Issue Type: Task > Components: indexing > Reporter: Nitin Gupta > Assignee: Mohit Kataria > Priority: Major > > FullTextBinaryTextExtractor incorrectly identifies a text file as CSV and > fails while parsing it. > A text file consisting of content with a strucutre similar to a CSV file like > - > {code:java} > a,b > a,b > a,b > a,b > a,b > a,b > a,b > a,b > a,b > a,b > a,b > a,b > a,b > a,b > {code} > even the extension of .txt gets parsed by CSV parser via the > AutoDetectorParser in tika. > This leads to a run time exception in an OSGI based setup if the commons-csv > bundle is not provided. > {code:java} > Caused by: java.lang.NoClassDefFoundError: org/apache/commons/csv/CSVFormat > at > org.apache.tika.parser.csv.TextAndCSVParser.parse(TextAndCSVParser.java:169) > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:159) > at > org.apache.jackrabbit.oak.osgi.TikaExtractionOsgiIT.assertFileContains(TikaExtractionOsgiIT.java:215) > at > org.apache.jackrabbit.oak.osgi.TikaExtractionOsgiIT.text2(TikaExtractionOsgiIT.java:204) > at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native > Method) > at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) > at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.base/java.lang.reflect.Method.invoke(Method.java:568) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.ops4j.pax.exam.invoker.junit.internal.ContainerTestRunner.runLeafWithRetry(ContainerTestRunner.java:97) > at > org.ops4j.pax.exam.invoker.junit.internal.ContainerTestRunner.runChildWithRetry(ContainerTestRunner.java:84) > at > org.ops4j.pax.exam.invoker.junit.internal.ContainerTestRunner.runChild(ContainerTestRunner.java:75) > at > org.ops4j.pax.exam.invoker.junit.internal.ContainerTestRunner.runChild(ContainerTestRunner.java:43) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > at org.junit.runners.ParentRunner.run(ParentRunner.java:363) > at org.junit.runner.JUnitCore.run(JUnitCore.java:137) > at org.junit.runner.JUnitCore.run(JUnitCore.java:115) > at > org.ops4j.pax.exam.invoker.junit.internal.JUnitProbeInvoker.invokeViaJUnit(JUnitProbeInvoker.java:124) > ... 25 more {code} > I will add a test to demonstrate this. > > We should handle this gracefully in oak and maybe use the parser based on the > file extension or mime type as a backup for the AutoDetectParser. -- This message was sent by Atlassian Jira (v8.20.10#820010)