Hi Ted, You can use the new version of CSV reader (binding the CompliantTextBatchReader) to query the CSV since 1.16 (no changes in the usage). But this reader does not support your idea. I think we can provide a few codes to enhance the reader. All the new storage and format plugin base the EVF, more powerful and stable.
> 2021年5月20日 下午10:40,Ted Dunning <ted.dunn...@gmail.com> 写道: > > Luoc, > > How do I use the CompliantTextBatchReader? > > How is the speed? > > Can you point me at the old CSV reader? I am not sure where it is. > > > > On Thu, May 20, 2021 at 1:09 AM luoc <l...@apache.org> wrote: > >> Hello Ted, >> It's nice idea. I have done a quick review for the CSV reader, but not >> found any settings to process the errors. And then, We have refactored the >> CSV format using the EVF, please see the CompliantTextBatchReader.java >> (Complies with the RFC 4180 standard for text/csv files). >> >>> 在 2021年5月20日,13:49,Ted Dunning <ted.dunn...@gmail.com> 写道: >>> >>> I have a csv file that causes an exception when read by Drill. The file >> is >>> slightly mal-formed (but R can read it). >>> >>> Interestingly, if I don't parse the header line, I don't get the >> exception >>> and the problematic embedded quotes are handled well. Likewise, deleting >>> the first data line (which is well-formed) causes the exception to go >> away. >>> Deleting the second data line also causes the exception to stop. Fixing >> the >>> quoting of the included quotes also fixes the problem. Swapping the lines >>> works like deleting the first line. Repeating the first line after the >>> second line still gets the exception. >>> >>> The file is this: >>> ------------------------- >>> >>> desc,name >>> >>> "foo","x" >>> >>> "manure called "foo"","y" >>> >>> ------------- >>> >>> >>> The exception is shown below. My thought is that if the CSV file is >>> considered mal-formed, we should get an error on the line that says >>> something along the lines of "mal-formed input". Even better would be to >>> allow such lines to be omitted (up to some sanity limit) or to parse it >>> correctly (which happens without headers being parsed). >>> >>> Anybody have any thoughts? >>> >>> Here is the R behavior (it omits the embedded quotes): >>> >>>> f = read.csv("v.csv") >>> >>>> f >>> >>> desc name >>> >>> 1 foo x >>> >>> 2 manure called foo y >>> >>> >>> And here is the exception: >>> >>> org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR: >>> NegativeArraySizeException Please, refer to logs for more information. >>> [Error Id: 7153f837-45eb-43d1-8e19-e3ca0197c61b ] >>> (java.lang.NegativeArraySizeException) null >>> org.apache.drill.exec.vector.VarCharVector$Accessor.get():487 >>> org.apache.drill.exec.vector.VarCharVector$Accessor.getObject():514 >>> org.apache.drill.exec.vector.VarCharVector$Accessor.getObject():475 >>> org.apache.drill.exec.server.rest.WebUserConnection.sendData():147 >>> org.apache.drill.exec.ops.AccountingUserConnection.sendData():42 >>> >> org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext():120 >>> org.apache.drill.exec.physical.impl.BaseRootExec.next():94 >>> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():296 >>> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():283 >>> java.security.AccessController.doPrivileged():-2 >>> javax.security.auth.Subject.doAs():422 >>> org.apache.hadoop.security.UserGroupInformation.doAs():1669 >>> org.apache.drill.exec.work.fragment.FragmentExecutor.run():283 >>> org.apache.drill.common.SelfCleaningRunnable.run():38 >>> java.util.concurrent.ThreadPoolExecutor.runWorker():1149 >>> java.util.concurrent.ThreadPoolExecutor$Worker.run():624 >>> java.lang.Thread.run():748 >>