Kengo Seki created PARQUET-1598:
-----------------------------------
Summary: Improve error message when convert-csv fails due to an
invalid input file name
Key: PARQUET-1598
URL: https://issues.apache.org/jira/browse/PARQUET-1598
Project: Parquet
Issue Type: Improvement
Components: parquet-cli
Reporter: Kengo Seki
I ran parquet-cli's {{convert-csv}} with an input file which name starts with a
numeric character without {{--schema}} option and got the following error:
{code}
$ java -cp 'target/*:target/dependency/*' org.apache.parquet.cli.Main
convert-csv 0sample.csv -o sample.parquet
Unknown error
shaded.parquet.org.apache.avro.SchemaParseException: Illegal initial character:
0sample
at shaded.parquet.org.apache.avro.Schema.validateName(Schema.java:1498)
at shaded.parquet.org.apache.avro.Schema.access$200(Schema.java:86)
at shaded.parquet.org.apache.avro.Schema$Name.<init>(Schema.java:645)
at shaded.parquet.org.apache.avro.Schema.createRecord(Schema.java:182)
at
shaded.parquet.org.apache.avro.SchemaBuilder$RecordBuilder.fields(SchemaBuilder.java:1805)
at
org.apache.parquet.cli.csv.AvroCSV.inferSchemaInternal(AvroCSV.java:158)
at
org.apache.parquet.cli.csv.AvroCSV.inferNullableSchema(AvroCSV.java:78)
at
org.apache.parquet.cli.commands.ConvertCSVCommand.run(ConvertCSVCommand.java:160)
at org.apache.parquet.cli.Main.run(Main.java:147)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.parquet.cli.Main.main(Main.java:177)
{code}
This is because that {{convert-csv}} uses the input file name as the name for
the output schema, while Avro requires its schema name to match the regex
pattern {{[A-Za-z_][A-Za-z0-9_]*}}.
So users have to change the input file name or use the {{--schema}} option
explicitly, but it's not so obvious from the error message.
It'd be nice if the message were improved, or the schema name were
automatically replaced with valid characters to avoid this problem.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)