So I'm having the same case sensitivity issue mentioned in a previous thread: https://groups.google.com/forum/#!topic/parquet-dev/ko-TM2lLpxE
The solution that Christos posted works great, but it didn't work for me when it comes to *partitioned* external tables, either I couldn't read or I couldn't write. All of the data I'm working with is already partitioned in HDFS so all I need to do is run an 'ALTER TABLE table ADD PARTITION (partitionkey = blah) LOCATION '/path/'. The workaround I made for this was by editing the init function in the DataWritableReadSupport class (Original - https://github.com/Parquet/parquet-mr/blob/7b0778c490e6782a83663bd5b1ec9d8a7dd7c2ae/parquet-hive/parquet-hive-storage-handler/src/main/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java), so that lower-cased field names would be used for the Hive table and when the Parquet files are being read, the typeListWanted is edited so that it properly reads the data that I need. I'm able to insert all of my data and run queries on it in Hive. if (columns != null) { final List<String> listColumns = getColumns(columns); /* EDIT - create a map that maps lowercase field name -> normal field name from the parquet files */ final Map<String, String> lowerCaseFileSchemaColumns = new HashMap<String,String>(); for(ColumnDescriptor c : fileSchema.getColumns()) { lowerCaseFileSchemaColumns.put(c.getPath()[0].toLowerCase(), c.getPath()[0]); } final List<Type> typeListTable = new ArrayList<Type>(); for (final String col : listColumns) { /* EDIT - check if a Hive column field exists in the map, instead of whether it exists in the parquet file schema. this is where the case sensitivity would normally cause a problem. if it exists, get the type information from the parquet file schema (we need the case sensitive field name to get it) */ if (lowerCaseFileSchemaColumns.containsKey(col)) { typeListTable.add(fileSchema.getType(lowerCaseFileSchemaColumns.get(col))); } else { typeListTable.add(new PrimitiveType(Repetition.OPTIONAL, PrimitiveTypeName.BINARY, col)); } } MessageType tableSchema = new MessageType(TABLE_SCHEMA, typeListTable); contextMetadata.put(HIVE_SCHEMA_KEY, tableSchema.toString()); MessageType requestedSchemaByUser = tableSchema; final List<Integer> indexColumnsWanted = getReadColumnIDs(configuration); final List<Type> typeListWanted = new ArrayList<Type>(); /* EDIT - again we need the case sensitive field name for getType */ for (final Integer idx : indexColumnsWanted) { typeListWanted.add(tableSchema.getType(lowerCaseFileSchemaColumns.get(listColumns.get(idx)))); } .... I was wondering if there were any consequences of doing it this way that I missed and whether this fix or something similar could someday become a patch. -- *Raymond Lau* Software Engineer - Intern | [email protected] | (925) 395-3806
