Also, where would I find the unit tests for the compliant text reader? I have a simple enough case to write a unit test, but I can't see any reference to the class in question outside of working code.
On Thu, May 20, 2021 at 7:40 AM Ted Dunning <ted.dunn...@gmail.com> wrote: > > Luoc, > > How do I use the CompliantTextBatchReader? > > How is the speed? > > Can you point me at the old CSV reader? I am not sure where it is. > > > > On Thu, May 20, 2021 at 1:09 AM luoc <l...@apache.org> wrote: > >> Hello Ted, >> It's nice idea. I have done a quick review for the CSV reader, but not >> found any settings to process the errors. And then, We have refactored the >> CSV format using the EVF, please see the CompliantTextBatchReader.java >> (Complies with the RFC 4180 standard for text/csv files). >> >> > 在 2021年5月20日,13:49,Ted Dunning <ted.dunn...@gmail.com> 写道: >> > >> > I have a csv file that causes an exception when read by Drill. The >> file is >> > slightly mal-formed (but R can read it). >> > >> > Interestingly, if I don't parse the header line, I don't get the >> exception >> > and the problematic embedded quotes are handled well. Likewise, deleting >> > the first data line (which is well-formed) causes the exception to go >> away. >> > Deleting the second data line also causes the exception to stop. Fixing >> the >> > quoting of the included quotes also fixes the problem. Swapping the >> lines >> > works like deleting the first line. Repeating the first line after the >> > second line still gets the exception. >> > >> > The file is this: >> > ------------------------- >> > >> > desc,name >> > >> > "foo","x" >> > >> > "manure called "foo"","y" >> > >> > ------------- >> > >> > >> > The exception is shown below. My thought is that if the CSV file is >> > considered mal-formed, we should get an error on the line that says >> > something along the lines of "mal-formed input". Even better would be to >> > allow such lines to be omitted (up to some sanity limit) or to parse it >> > correctly (which happens without headers being parsed). >> > >> > Anybody have any thoughts? >> > >> > Here is the R behavior (it omits the embedded quotes): >> > >> >> f = read.csv("v.csv") >> > >> >> f >> > >> > desc name >> > >> > 1 foo x >> > >> > 2 manure called foo y >> > >> > >> > And here is the exception: >> > >> > org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR: >> > NegativeArraySizeException Please, refer to logs for more information. >> > [Error Id: 7153f837-45eb-43d1-8e19-e3ca0197c61b ] >> > (java.lang.NegativeArraySizeException) null >> > org.apache.drill.exec.vector.VarCharVector$Accessor.get():487 >> > org.apache.drill.exec.vector.VarCharVector$Accessor.getObject():514 >> > org.apache.drill.exec.vector.VarCharVector$Accessor.getObject():475 >> > org.apache.drill.exec.server.rest.WebUserConnection.sendData():147 >> > org.apache.drill.exec.ops.AccountingUserConnection.sendData():42 >> > >> org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext():120 >> > org.apache.drill.exec.physical.impl.BaseRootExec.next():94 >> > org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():296 >> > org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():283 >> > java.security.AccessController.doPrivileged():-2 >> > javax.security.auth.Subject.doAs():422 >> > org.apache.hadoop.security.UserGroupInformation.doAs():1669 >> > org.apache.drill.exec.work.fragment.FragmentExecutor.run():283 >> > org.apache.drill.common.SelfCleaningRunnable.run():38 >> > java.util.concurrent.ThreadPoolExecutor.runWorker():1149 >> > java.util.concurrent.ThreadPoolExecutor$Worker.run():624 >> > java.lang.Thread.run():748 >> >