[ https://issues.apache.org/jira/browse/CRUNCH-564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14938186#comment-14938186 ]
mac champion edited comment on CRUNCH-564 at 9/30/15 6:10 PM: -------------------------------------------------------------- The entry point into this clump of files is really the CSVInputFormat https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/io/text/csv/CSVInputFormat.java#L47 I believe this is what consumers use, they don't access any of these other CSV files directly https://github.com/apache/crunch/tree/master/crunch-core/src/main/java/org/apache/crunch/io/text/csv If I understand correctly, the problem is that when the CSVInputFormater is instantiated, it has no configuration. Later, configure() is called. https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/io/text/csv/CSVInputFormat.java#L188-L200 When I wrote the code, it seems as though I was under the assumption that configuration.get(OPTION) would always return a blank string if OPTION was not set in the Crunch configuration. Now, it seems like that is not true. I took a look at the Configuration class and found this: {code} public String get(String name) { String[] names = handleDeprecation(deprecationContext.get(), name); String result = null; for(String n : names) { result = substituteVars(getProps().getProperty(n)); } return result; } {code} I think the behavior has changed, but I don't really feel like looking too deep into handleDeprecation and substitueVars to figure that out. Honestly, those calls to configuration and the parsing that follows just should have been more defensive in the first place. Edit: Created https://issues.apache.org/jira/browse/CRUNCH-565 was (Author: champgm): The entry point into this clump of files is really the CSVInputFormat https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/io/text/csv/CSVInputFormat.java#L47 I believe this is what consumers use, they don't access any of these other CSV files directly https://github.com/apache/crunch/tree/master/crunch-core/src/main/java/org/apache/crunch/io/text/csv If I understand correctly, the problem is that when the CSVInputFormater is instantiated, it has no configuration. Later, configure() is called. https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/io/text/csv/CSVInputFormat.java#L188-L200 When I wrote the code, it seems as though I was under the assumption that configuration.get(OPTION) would always return a blank string if OPTION was not set in the Crunch configuration. Now, it seems like that is not true. I took a look at the Configuration class and found this: {code} public String get(String name) { String[] names = handleDeprecation(deprecationContext.get(), name); String result = null; for(String n : names) { result = substituteVars(getProps().getProperty(n)); } return result; } {code} I think the behavior has changed, but I don't really feel like looking too deep into handleDeprecation and substitueVars to figure that out. Honestly, those calls to configuration and the parsing that follows just should have been more defensive in the first place. > Add support for using escape character same as open/close quote character > ------------------------------------------------------------------------- > > Key: CRUNCH-564 > URL: https://issues.apache.org/jira/browse/CRUNCH-564 > Project: Crunch > Issue Type: Improvement > Components: Core > Reporter: Muhammad > Assignee: Josh Wills > Priority: Trivial > Labels: csv, csvparser > > As a user I would like to use CSVInputFormat to handle the CSV files > following this RFC http://www.ietf.org/rfc/rfc4180.txt. > Many developers use Apache StringEscapeUtils.escapeCsv( ) method to escape > their CSVs. The method escapes the CSV following the RFC4180. > https://commons.apache.org/proper/commons-lang/javadocs/api-2.6/org/apache/commons/lang/StringEscapeUtils.html > The CSVLineReader throws exception in such a case. We can enhance the code to > support the CSVs that use escape same as the quote characters. > https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/io/text/csv/CSVLineReader.java#L152 > I would appreciate a comment, if someone has knowingly rejected the idea due > to some technical limitation or a problem with allowing escape and quote as > same characters. By the way Apache HAWQ seem to get around this issue somehow > and reads such CSVs alright. -- This message was sent by Atlassian JIRA (v6.3.4#6332)