Kengo Seki created PARQUET-1598: ----------------------------------- Summary: Improve error message when convert-csv fails due to an invalid input file name Key: PARQUET-1598 URL: https://issues.apache.org/jira/browse/PARQUET-1598 Project: Parquet Issue Type: Improvement Components: parquet-cli Reporter: Kengo Seki
I ran parquet-cli's {{convert-csv}} with an input file which name starts with a numeric character without {{--schema}} option and got the following error: {code} $ java -cp 'target/*:target/dependency/*' org.apache.parquet.cli.Main convert-csv 0sample.csv -o sample.parquet Unknown error shaded.parquet.org.apache.avro.SchemaParseException: Illegal initial character: 0sample at shaded.parquet.org.apache.avro.Schema.validateName(Schema.java:1498) at shaded.parquet.org.apache.avro.Schema.access$200(Schema.java:86) at shaded.parquet.org.apache.avro.Schema$Name.<init>(Schema.java:645) at shaded.parquet.org.apache.avro.Schema.createRecord(Schema.java:182) at shaded.parquet.org.apache.avro.SchemaBuilder$RecordBuilder.fields(SchemaBuilder.java:1805) at org.apache.parquet.cli.csv.AvroCSV.inferSchemaInternal(AvroCSV.java:158) at org.apache.parquet.cli.csv.AvroCSV.inferNullableSchema(AvroCSV.java:78) at org.apache.parquet.cli.commands.ConvertCSVCommand.run(ConvertCSVCommand.java:160) at org.apache.parquet.cli.Main.run(Main.java:147) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.parquet.cli.Main.main(Main.java:177) {code} This is because that {{convert-csv}} uses the input file name as the name for the output schema, while Avro requires its schema name to match the regex pattern {{[A-Za-z_][A-Za-z0-9_]*}}. So users have to change the input file name or use the {{--schema}} option explicitly, but it's not so obvious from the error message. It'd be nice if the message were improved, or the schema name were automatically replaced with valid characters to avoid this problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005)