[
https://issues.apache.org/jira/browse/IMPALA-6374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16669394#comment-16669394
]
ASF subversion and git services commented on IMPALA-6374:
---------------------------------------------------------
Commit 85166afa8a884d6b1430a773f8877e2dfc2eb9f5 in impala's branch
refs/heads/master from [[email protected]]
[ https://git-wip-us.apache.org/repos/asf?p=impala.git;h=85166af ]
IMPALA-6374: fix handling of commas in .test files
The .test file parser implemented an unconventional method for parsing
single-quoted strings in comma-separated value format. This didn't handle
trailing commas in the string correctly.
This commit switches to using a conventional method for parsing
comma-separated value format:
* Commas enclosed by single quotes are not treated as field separators
* Single quotes can be escaped within a string by doubling them.
I looked into using Python's .csv module for this, but it wouldn't
work without modifying the test file format more because it
automatically discards the quotes during parsing, which are actually
semantically important in .test files. E.g. without the quotes we can't
distinguish between the literal string 'regex:...' and the regex
regex:....
Testing:
Ran exhaustive tests and fixed .test files that required modifications.
Will rerun before merging.
Added a couple of tests to exercise edge cases in the test file parser.
Change-Id: I18ddcb0440490ddf8184be66d3681038a1615dd9
Reviewed-on: http://gerrit.cloudera.org:8080/11800
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Tim Armstrong <[email protected]>
> test tpcds-q98.test has some incorrect data
> --------------------------------------------
>
> Key: IMPALA-6374
> URL: https://issues.apache.org/jira/browse/IMPALA-6374
> Project: IMPALA
> Issue Type: Test
> Components: Infrastructure
> Affects Versions: Impala 2.9.0
> Reporter: Stephen Carlin
> Assignee: Tim Armstrong
> Priority: Critical
>
> I happened to look through the unit tests and it looks like tpcds-q98.test
> has some bad data in it, but it is verifying correctly.
> One example (among maybe 12 or so) is on line 469:
> line 468: 'AAAAAAAAEKGDAAAA','Houses should
> ','Books','mystery',1.77,3341.80,1.96
> line 469: 'AAAAAAAAFFDDAAAA',','Books','mystery',2.79,4237.23,2.49
> Note that the 2nd field for line 468 looks normal, but line 469 has just a
> single quote.
> I believe this is happening on all strings that end with a comma for this
> test. The correct result for this line (I believe) should be (note the comma
> after Poor):
> 'AAAAAAAAFFDDAAAA','French, civil hours must report essential values.
> Reasonable, complete judges vary clearly homes; often pleasant women would
> watch. Poor,','Books','mystery',2.79,4237.23,2.48
> My guess as to why this is happening is some code in test_result_verifier.py,
> specifically in the part that says:
> for col_val in row_string.split(','):
> # This is a bit tricky because we need to handle the case where a comma
> may be in
> # the middle of a string. We detect this by finding a split that starts
> with an
> # opening string character but that doesn't end in a string character.
> It is
> # possible for the first character to be a single-quote, so handle that
> case
> if (col_val.startswith("'") and not col_val.endswith("'")) or (col_val
> == "'"):
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]