Re: Hive-Parquet case sensitivity

Brock Noland Tue, 29 Jul 2014 18:04:30 -0700

Hi,

Thanks for the message. I am looking at this issue myself.


Which version if Hive are you using from which distribution?

Brock
On Jul 29, 2014 1:09 PM, "Raymond Lau" <[email protected]> wrote:

> So I'm having the same case sensitivity issue mentioned in a previous
> thread: https://groups.google.com/forum/#!topic/parquet-dev/ko-TM2lLpxE
>
> The solution that Christos posted works great, but it didn't work for me
> when it comes to *partitioned* external tables, either I couldn't read or I
> couldn't write.  All of the data I'm working with is already partitioned in
> HDFS so all I need to do is run an 'ALTER TABLE table ADD PARTITION
> (partitionkey = blah) LOCATION '/path/'.
>
> The workaround I made for this was by editing the init function in the
> DataWritableReadSupport class (Original -
>
> https://github.com/Parquet/parquet-mr/blob/7b0778c490e6782a83663bd5b1ec9d8a7dd7c2ae/parquet-hive/parquet-hive-storage-handler/src/main/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java
> ),
> so that lower-cased field names would be used for the Hive table and when
> the Parquet files are being read, the typeListWanted is edited so that it
> properly reads the data that I need.  I'm able to insert all of my data and
> run queries on it in Hive.
>
>     if (columns != null) {
>             final List<String> listColumns = getColumns(columns);
>
>             /* EDIT - create a map that maps lowercase field name -> normal
> field name from the parquet files */
>             final Map<String, String> lowerCaseFileSchemaColumns = new
> HashMap<String,String>();
>             for(ColumnDescriptor c : fileSchema.getColumns()) {
>
> lowerCaseFileSchemaColumns.put(c.getPath()[0].toLowerCase(),
> c.getPath()[0]);
>             }
>
>             final List<Type> typeListTable = new ArrayList<Type>();
>             for (final String col : listColumns) {
>                 /* EDIT - check if a Hive column field exists in the map,
> instead of whether it exists in the parquet file schema.  this is where the
> case sensitivity would normally cause a problem.  if it exists, get the
> type information from the parquet file schema (we need the case sensitive
> field name to get it) */
>                 if (lowerCaseFileSchemaColumns.containsKey(col)) {
>
> typeListTable.add(fileSchema.getType(lowerCaseFileSchemaColumns.get(col)));
>                 } else {
>                     typeListTable.add(new
> PrimitiveType(Repetition.OPTIONAL, PrimitiveTypeName.BINARY, col));
>                 }
>             }
>
>             MessageType tableSchema = new MessageType(TABLE_SCHEMA,
> typeListTable);
>             contextMetadata.put(HIVE_SCHEMA_KEY, tableSchema.toString());
>
>             MessageType requestedSchemaByUser = tableSchema;
>             final List<Integer> indexColumnsWanted =
> getReadColumnIDs(configuration);
>
>             final List<Type> typeListWanted = new ArrayList<Type>();
>
>             /* EDIT - again we need the case sensitive field name for
> getType */
>             for (final Integer idx : indexColumnsWanted) {
>
>
> typeListWanted.add(tableSchema.getType(lowerCaseFileSchemaColumns.get(listColumns.get(idx))));
>             }
>
>     ....
>
> I was wondering if there were any consequences of doing it this way that I
> missed and whether this fix or something similar could someday become a
> patch.
>
> --
> *Raymond Lau*
> Software Engineer - Intern |
> [email protected] | (925) 395-3806
>

Re: Hive-Parquet case sensitivity

Reply via email to