[jira] [Commented] (DRILL-3178) csv reader should allow newlines inside quotes
[ https://issues.apache.org/jira/browse/DRILL-3178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15626977#comment-15626977 ] Krystal commented on DRILL-3178: commit id: 83513daf0903e0d94fcaad7b1ae4e8ad6272b494 Using data from comment #2, verified that data gets returned as expected. select * from `drill-3178/drill3178.csv`; +--+ | columns | +--+ | ["1","line1"]| | ["2","line2\n"] | | ["3","line3"]| +--+ 3 rows selected (0.158 seconds) select columns[0], columns[1] from `drill-3178/drill3178.csv`; +-+-+ | EXPR$0 | EXPR$1 | +-+-+ | 1 | line1 | | 2 | line2 | | 3 | line3 | +-+-+ select columns[0],columns[1] from `drill-3178/drill3178.csv` where columns[0] > 1 order by columns[1] desc; +-+-+ | EXPR$0 | EXPR$1 | +-+-+ | 3 | line3 | | 2 | line2 | +-+-+ 2 rows selected (0.373 seconds) > csv reader should allow newlines inside quotes > --- > > Key: DRILL-3178 > URL: https://issues.apache.org/jira/browse/DRILL-3178 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Text & CSV >Affects Versions: 1.0.0 > Environment: Ubuntu Trusty 14.04.2 LTS >Reporter: Neal McBurnett >Assignee: F Méthot > Fix For: 1.9.0 > > Attachments: drill-3178.patch > > > When reading a csv file which contains newlines within quoted strings, e.g. > via > select * from dfs.`/tmp/q.csv`; > Drill 1.0 says: > Error: SYSTEM ERROR: com.univocity.parsers.common.TextParsingException: > Error processing input: Cannot use newline character within quoted string > But many tools produce csv files with newlines in quoted strings. Drill > should be able to handle them. > Workaround: the csvquote program (https://github.com/dbro/csvquote) can > encode embedded commas and newlines, and even decode them later if desired. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-3178) csv reader should allow newlines inside quotes
[ https://issues.apache.org/jira/browse/DRILL-3178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15587056#comment-15587056 ] ASF GitHub Bot commented on DRILL-3178: --- Github user asfgit closed the pull request at: https://github.com/apache/drill/pull/593 > csv reader should allow newlines inside quotes > --- > > Key: DRILL-3178 > URL: https://issues.apache.org/jira/browse/DRILL-3178 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Text & CSV >Affects Versions: 1.0.0 > Environment: Ubuntu Trusty 14.04.2 LTS >Reporter: Neal McBurnett >Assignee: F Méthot > Fix For: Future > > Attachments: drill-3178.patch > > > When reading a csv file which contains newlines within quoted strings, e.g. > via > select * from dfs.`/tmp/q.csv`; > Drill 1.0 says: > Error: SYSTEM ERROR: com.univocity.parsers.common.TextParsingException: > Error processing input: Cannot use newline character within quoted string > But many tools produce csv files with newlines in quoted strings. Drill > should be able to handle them. > Workaround: the csvquote program (https://github.com/dbro/csvquote) can > encode embedded commas and newlines, and even decode them later if desired. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-3178) csv reader should allow newlines inside quotes
[ https://issues.apache.org/jira/browse/DRILL-3178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15577086#comment-15577086 ] ASF GitHub Bot commented on DRILL-3178: --- Github user parthchandra commented on the issue: https://github.com/apache/drill/pull/593 +1 > csv reader should allow newlines inside quotes > --- > > Key: DRILL-3178 > URL: https://issues.apache.org/jira/browse/DRILL-3178 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Text & CSV >Affects Versions: 1.0.0 > Environment: Ubuntu Trusty 14.04.2 LTS >Reporter: Neal McBurnett >Assignee: F Méthot > Fix For: Future > > Attachments: drill-3178.patch > > > When reading a csv file which contains newlines within quoted strings, e.g. > via > select * from dfs.`/tmp/q.csv`; > Drill 1.0 says: > Error: SYSTEM ERROR: com.univocity.parsers.common.TextParsingException: > Error processing input: Cannot use newline character within quoted string > But many tools produce csv files with newlines in quoted strings. Drill > should be able to handle them. > Workaround: the csvquote program (https://github.com/dbro/csvquote) can > encode embedded commas and newlines, and even decode them later if desired. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-3178) csv reader should allow newlines inside quotes
[ https://issues.apache.org/jira/browse/DRILL-3178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15557029#comment-15557029 ] ASF GitHub Bot commented on DRILL-3178: --- Github user fmethot commented on the issue: https://github.com/apache/drill/pull/593 Thanks for the comments so far, See my newer changes, I suggest that we remove the flag and add an extra method instead. There is no more check for a boolean in nextChar, but instead there is an extra method call (readNext()->readNextNoNewLineCheck()) > csv reader should allow newlines inside quotes > --- > > Key: DRILL-3178 > URL: https://issues.apache.org/jira/browse/DRILL-3178 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Text & CSV >Affects Versions: 1.0.0 > Environment: Ubuntu Trusty 14.04.2 LTS >Reporter: Neal McBurnett >Assignee: F Méthot > Fix For: Future > > Attachments: drill-3178.patch > > > When reading a csv file which contains newlines within quoted strings, e.g. > via > select * from dfs.`/tmp/q.csv`; > Drill 1.0 says: > Error: SYSTEM ERROR: com.univocity.parsers.common.TextParsingException: > Error processing input: Cannot use newline character within quoted string > But many tools produce csv files with newlines in quoted strings. Drill > should be able to handle them. > Workaround: the csvquote program (https://github.com/dbro/csvquote) can > encode embedded commas and newlines, and even decode them later if desired. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-3178) csv reader should allow newlines inside quotes
[ https://issues.apache.org/jira/browse/DRILL-3178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15553523#comment-15553523 ] ASF GitHub Bot commented on DRILL-3178: --- Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/593#discussion_r82303872 --- Diff: exec/java-exec/src/test/resources/store/text/WithQuotedCrLf.tbl --- @@ -0,0 +1,6 @@ +"a +1"|a|a +a|"a +2"|a +a|a|"a +3" --- End diff -- Is there an issue with git converting Windows-style newlines (\r\n) into Unix-style (\n) when this file is checked in & out? Will that mess up the test? Should the test generate this file to handle this particular special case? > csv reader should allow newlines inside quotes > --- > > Key: DRILL-3178 > URL: https://issues.apache.org/jira/browse/DRILL-3178 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Text & CSV >Affects Versions: 1.0.0 > Environment: Ubuntu Trusty 14.04.2 LTS >Reporter: Neal McBurnett >Assignee: F Méthot > Fix For: Future > > Attachments: drill-3178.patch > > > When reading a csv file which contains newlines within quoted strings, e.g. > via > select * from dfs.`/tmp/q.csv`; > Drill 1.0 says: > Error: SYSTEM ERROR: com.univocity.parsers.common.TextParsingException: > Error processing input: Cannot use newline character within quoted string > But many tools produce csv files with newlines in quoted strings. Drill > should be able to handle them. > Workaround: the csvquote program (https://github.com/dbro/csvquote) can > encode embedded commas and newlines, and even decode them later if desired. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-3178) csv reader should allow newlines inside quotes
[ https://issues.apache.org/jira/browse/DRILL-3178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15553520#comment-15553520 ] ASF GitHub Bot commented on DRILL-3178: --- Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/593#discussion_r82304834 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text/compliant/TextReader.java --- @@ -231,33 +231,34 @@ private void parseQuotedValue(byte prev) throws IOException { final TextInput input = this.input; final byte quote = this.quote; -ch = input.nextChar(); +try { + input.setMonitorForNewLine(false); + ch = input.nextChar(); -while (!(prev == quote && (ch == delimiter || ch == newLine || isWhite(ch { - if (ch != quote) { -if (prev == quote) { // unescaped quote detected - if (parseUnescapedQuotes) { -output.append(quote); -output.append(ch); -parseQuotedValue(ch); -break; - } else { -throw new TextParsingException( -context, -"Unescaped quote character '" -+ quote -+ "' inside quoted value of CSV field. To allow unescaped quotes, set 'parseUnescapedQuotes' to 'true' in the CSV parser settings. Cannot parse CSV input."); + while (!(prev == quote && (ch == delimiter || ch == newLine || isWhite(ch { +if (ch != quote) { + if (prev == quote) { // unescaped quote detected +if (parseUnescapedQuotes) { + output.append(quote); + output.append(ch); + parseQuotedValue(ch); + break; +} else { + throw new TextParsingException(context, "Unescaped quote character '" + quote + "' inside quoted value of CSV field. To allow unescaped quotes, set 'parseUnescapedQuotes' to 'true' in the CSV parser settings. Cannot parse CSV input."); +} } + output.append(ch); + prev = ch; +} else if (prev == quoteEscape) { + output.append(quote); + prev = NULL_BYTE; +} else { + prev = ch; } -output.append(ch); -prev = ch; - } else if (prev == quoteEscape) { -output.append(quote); -prev = NULL_BYTE; - } else { -prev = ch; +ch = input.nextChar(); } - ch = input.nextChar(); +} finally { --- End diff -- I see why it is done in finally. However, as noted above, I'm not sure that pushing this kind of flag into the getChar function is the optimal approach... > csv reader should allow newlines inside quotes > --- > > Key: DRILL-3178 > URL: https://issues.apache.org/jira/browse/DRILL-3178 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Text & CSV >Affects Versions: 1.0.0 > Environment: Ubuntu Trusty 14.04.2 LTS >Reporter: Neal McBurnett >Assignee: F Méthot > Fix For: Future > > Attachments: drill-3178.patch > > > When reading a csv file which contains newlines within quoted strings, e.g. > via > select * from dfs.`/tmp/q.csv`; > Drill 1.0 says: > Error: SYSTEM ERROR: com.univocity.parsers.common.TextParsingException: > Error processing input: Cannot use newline character within quoted string > But many tools produce csv files with newlines in quoted strings. Drill > should be able to handle them. > Workaround: the csvquote program (https://github.com/dbro/csvquote) can > encode embedded commas and newlines, and even decode them later if desired. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-3178) csv reader should allow newlines inside quotes
[ https://issues.apache.org/jira/browse/DRILL-3178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15553522#comment-15553522 ] ASF GitHub Bot commented on DRILL-3178: --- Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/593#discussion_r82303401 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text/compliant/TextReader.java --- @@ -231,33 +231,34 @@ private void parseQuotedValue(byte prev) throws IOException { final TextInput input = this.input; final byte quote = this.quote; -ch = input.nextChar(); +try { + input.setMonitorForNewLine(false); --- End diff -- Seems an overly complex way to do the parsing. Is there any reason we want to capture the original newline character rather than the normalized one? If we need to capture the original one, then a cleaner way to do that is to keep track of the start & end position of the current token (character), and provide a method to return that block as a string. Then, scan for a close quote, reading characters & special-casing any newlines. If we want to include newlines in quoted strings sometimes, but not other times, then the check logic can be a bit more complex. But, the proposed solution of making newlines not be newlines seems a bit odd... > csv reader should allow newlines inside quotes > --- > > Key: DRILL-3178 > URL: https://issues.apache.org/jira/browse/DRILL-3178 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Text & CSV >Affects Versions: 1.0.0 > Environment: Ubuntu Trusty 14.04.2 LTS >Reporter: Neal McBurnett >Assignee: F Méthot > Fix For: Future > > Attachments: drill-3178.patch > > > When reading a csv file which contains newlines within quoted strings, e.g. > via > select * from dfs.`/tmp/q.csv`; > Drill 1.0 says: > Error: SYSTEM ERROR: com.univocity.parsers.common.TextParsingException: > Error processing input: Cannot use newline character within quoted string > But many tools produce csv files with newlines in quoted strings. Drill > should be able to handle them. > Workaround: the csvquote program (https://github.com/dbro/csvquote) can > encode embedded commas and newlines, and even decode them later if desired. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-3178) csv reader should allow newlines inside quotes
[ https://issues.apache.org/jira/browse/DRILL-3178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15553521#comment-15553521 ] ASF GitHub Bot commented on DRILL-3178: --- Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/593#discussion_r82296690 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text/compliant/TextInput.java --- @@ -88,6 +88,11 @@ private boolean endFound = false; /** + * Switch for enabling/disabling new line detection --- End diff -- Explain a bit more? Presumably, we already "monitor" and "detect" new lines in some way. What, specifically does this add? Presumably, it sets the mode to enable new line detection within quotes (the title of the Jira entry)? > csv reader should allow newlines inside quotes > --- > > Key: DRILL-3178 > URL: https://issues.apache.org/jira/browse/DRILL-3178 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Text & CSV >Affects Versions: 1.0.0 > Environment: Ubuntu Trusty 14.04.2 LTS >Reporter: Neal McBurnett >Assignee: F Méthot > Fix For: Future > > Attachments: drill-3178.patch > > > When reading a csv file which contains newlines within quoted strings, e.g. > via > select * from dfs.`/tmp/q.csv`; > Drill 1.0 says: > Error: SYSTEM ERROR: com.univocity.parsers.common.TextParsingException: > Error processing input: Cannot use newline character within quoted string > But many tools produce csv files with newlines in quoted strings. Drill > should be able to handle them. > Workaround: the csvquote program (https://github.com/dbro/csvquote) can > encode embedded commas and newlines, and even decode them later if desired. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-3178) csv reader should allow newlines inside quotes
[ https://issues.apache.org/jira/browse/DRILL-3178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15497971#comment-15497971 ] F Méthot commented on DRILL-3178: - attached file drill-3178/patch is a simple fix for handling crlf within quoted value. Most of the test fail on my dev env, I don't have credential to push my local branch to the server and to have it build and test there. > csv reader should allow newlines inside quotes > --- > > Key: DRILL-3178 > URL: https://issues.apache.org/jira/browse/DRILL-3178 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Text & CSV >Affects Versions: 1.0.0 > Environment: Ubuntu Trusty 14.04.2 LTS >Reporter: Neal McBurnett >Assignee: F Méthot > Fix For: Future > > Attachments: drill-3178.patch > > > When reading a csv file which contains newlines within quoted strings, e.g. > via > select * from dfs.`/tmp/q.csv`; > Drill 1.0 says: > Error: SYSTEM ERROR: com.univocity.parsers.common.TextParsingException: > Error processing input: Cannot use newline character within quoted string > But many tools produce csv files with newlines in quoted strings. Drill > should be able to handle them. > Workaround: the csvquote program (https://github.com/dbro/csvquote) can > encode embedded commas and newlines, and even decode them later if desired. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-3178) csv reader should allow newlines inside quotes
[ https://issues.apache.org/jira/browse/DRILL-3178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15458600#comment-15458600 ] F Méthot commented on DRILL-3178: - With 1.7 build, for this file: > cat data/3428.csv 1,"line1" 2,"line2 " 3,"line3" I get: > select * from my_dfs.`/root/data/3428.csv`; Error: DATA_READ ERROR: Error processing input: Cannot use newline character within quoted string, line=3, char=22. Content parsed: [ ] Failure while reading file file:///root/data/3428.csv. Happened at or shortly before byte position 22. Fragment 0:0 [Error Id: 49a05427-e763-4cca-9f97-e4b4308ecb75 on perfnode206.perf.lab:31010] (state=,code=0) > csv reader should allow newlines inside quotes > --- > > Key: DRILL-3178 > URL: https://issues.apache.org/jira/browse/DRILL-3178 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Text & CSV >Affects Versions: 1.0.0 > Environment: Ubuntu Trusty 14.04.2 LTS >Reporter: Neal McBurnett > Fix For: Future > > > When reading a csv file which contains newlines within quoted strings, e.g. > via > select * from dfs.`/tmp/q.csv`; > Drill 1.0 says: > Error: SYSTEM ERROR: com.univocity.parsers.common.TextParsingException: > Error processing input: Cannot use newline character within quoted string > But many tools produce csv files with newlines in quoted strings. Drill > should be able to handle them. > Workaround: the csvquote program (https://github.com/dbro/csvquote) can > encode embedded commas and newlines, and even decode them later if desired. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-3178) csv reader should allow newlines inside quotes
[ https://issues.apache.org/jira/browse/DRILL-3178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15216384#comment-15216384 ] Daniel Reznick commented on DRILL-3178: --- As drill is meant for working with data in place, having to pre-process files prior to use with drill is counter-productive. Drill should work hard to read data as is when possible, and as noted many other tools both read and write delimited content with newlines in quoted fields. > csv reader should allow newlines inside quotes > --- > > Key: DRILL-3178 > URL: https://issues.apache.org/jira/browse/DRILL-3178 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Text & CSV >Affects Versions: 1.0.0 > Environment: Ubuntu Trusty 14.04.2 LTS >Reporter: Neal McBurnett > Fix For: Future > > > When reading a csv file which contains newlines within quoted strings, e.g. > via > select * from dfs.`/tmp/q.csv`; > Drill 1.0 says: > Error: SYSTEM ERROR: com.univocity.parsers.common.TextParsingException: > Error processing input: Cannot use newline character within quoted string > But many tools produce csv files with newlines in quoted strings. Drill > should be able to handle them. > Workaround: the csvquote program (https://github.com/dbro/csvquote) can > encode embedded commas and newlines, and even decode them later if desired. -- This message was sent by Atlassian JIRA (v6.3.4#6332)