Hi Ted, You can use this reader without switching if you are using the latest version (1.19.0 for better). There are unit tests related to the compliant text reader (in `drill-java-exec` module, at the `org.apache.drill.exec.store.easy.text.compliant` package).
> 2021年5月23日 上午5:19,Ted Dunning <ted.dunn...@gmail.com> 写道: > > Also, where would I find the unit tests for the compliant text reader? > > I have a simple enough case to write a unit test, but I can't see any > reference to the class in question outside of working code. > > > On Thu, May 20, 2021 at 7:40 AM Ted Dunning <ted.dunn...@gmail.com> wrote: > >> >> Luoc, >> >> How do I use the CompliantTextBatchReader? >> >> How is the speed? >> >> Can you point me at the old CSV reader? I am not sure where it is. >> >> >> >> On Thu, May 20, 2021 at 1:09 AM luoc <l...@apache.org> wrote: >> >>> Hello Ted, >>> It's nice idea. I have done a quick review for the CSV reader, but not >>> found any settings to process the errors. And then, We have refactored the >>> CSV format using the EVF, please see the CompliantTextBatchReader.java >>> (Complies with the RFC 4180 standard for text/csv files). >>> >>>> 在 2021年5月20日,13:49,Ted Dunning <ted.dunn...@gmail.com> 写道: >>>> >>>> I have a csv file that causes an exception when read by Drill. The >>> file is >>>> slightly mal-formed (but R can read it). >>>> >>>> Interestingly, if I don't parse the header line, I don't get the >>> exception >>>> and the problematic embedded quotes are handled well. Likewise, deleting >>>> the first data line (which is well-formed) causes the exception to go >>> away. >>>> Deleting the second data line also causes the exception to stop. Fixing >>> the >>>> quoting of the included quotes also fixes the problem. Swapping the >>> lines >>>> works like deleting the first line. Repeating the first line after the >>>> second line still gets the exception. >>>> >>>> The file is this: >>>> ------------------------- >>>> >>>> desc,name >>>> >>>> "foo","x" >>>> >>>> "manure called "foo"","y" >>>> >>>> ------------- >>>> >>>> >>>> The exception is shown below. My thought is that if the CSV file is >>>> considered mal-formed, we should get an error on the line that says >>>> something along the lines of "mal-formed input". Even better would be to >>>> allow such lines to be omitted (up to some sanity limit) or to parse it >>>> correctly (which happens without headers being parsed). >>>> >>>> Anybody have any thoughts? >>>> >>>> Here is the R behavior (it omits the embedded quotes): >>>> >>>>> f = read.csv("v.csv") >>>> >>>>> f >>>> >>>> desc name >>>> >>>> 1 foo x >>>> >>>> 2 manure called foo y >>>> >>>> >>>> And here is the exception: >>>> >>>> org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR: >>>> NegativeArraySizeException Please, refer to logs for more information. >>>> [Error Id: 7153f837-45eb-43d1-8e19-e3ca0197c61b ] >>>> (java.lang.NegativeArraySizeException) null >>>> org.apache.drill.exec.vector.VarCharVector$Accessor.get():487 >>>> org.apache.drill.exec.vector.VarCharVector$Accessor.getObject():514 >>>> org.apache.drill.exec.vector.VarCharVector$Accessor.getObject():475 >>>> org.apache.drill.exec.server.rest.WebUserConnection.sendData():147 >>>> org.apache.drill.exec.ops.AccountingUserConnection.sendData():42 >>>> >>> org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext():120 >>>> org.apache.drill.exec.physical.impl.BaseRootExec.next():94 >>>> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():296 >>>> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():283 >>>> java.security.AccessController.doPrivileged():-2 >>>> javax.security.auth.Subject.doAs():422 >>>> org.apache.hadoop.security.UserGroupInformation.doAs():1669 >>>> org.apache.drill.exec.work.fragment.FragmentExecutor.run():283 >>>> org.apache.drill.common.SelfCleaningRunnable.run():38 >>>> java.util.concurrent.ThreadPoolExecutor.runWorker():1149 >>>> java.util.concurrent.ThreadPoolExecutor$Worker.run():624 >>>> java.lang.Thread.run():748 >>> >>