Hello, I updated to the latest versions of everything in the Parquet ecosystem and the annotations in the message are coming out when reading the parquet file, so excuse the last communication please.
Question: Can I open a Parquet Fie with an instance of FSDataInputStream instead of Path? What I have done was inspired from the CSV to Parquet example on GitHub. We are using Parquet as a storage for our proprietary record format. We also are reading existing Parquet, then translating to our proprietary record format. In short when I open a Parquet file with File or Path, then query the footer for the message type, using the extra info annotation, I derive a our Record schema in the end, and then read the data from parquet into our record format one a time. It is working well, records go in and come out, and data checks. Below is a summary of what I am doing below: 1> The message schema: Path parquetFilePath = .. ParquetMetadata readFooter = null; readFooter = ParquetFileReader.readFooter(configuration, parquetFilePath); MessageType schema = readFooter.getFileMetaData().getSchema(); 2> Then a reader : Path path = .. GroupReadSupport readSupport = new GroupReadSupport(); readSupport.init(configuration, null, schemaParquet); ParquetReader<Group> reader; try { reader = new ParquetReader<Group>(path, readSupport); } catch (IOException e) { LOG.error("We can not create Parquet Reader " + e) ; e.printStackTrace(); throw new ReadParquestFileException(e); } 3> Get Data sequentially: Group group; // my record Record dmRecord =.. … // is there another group if ((group = reader.read()) != null) { for (int index = 0; index < MY_RECORD_LENGTH ; index++) { // stuff with data GroupType groupType = group.getType(); String fieldName = groupType.getFieldName(index); Type type = groupType.getType(index); if (type.isPrimitive()) { PrimitiveType pt = (PrimitiveType) type; PrimitiveTypeName ptn = pt.getPrimitiveTypeName(); String method = ptn.getMethod; String primitiveName = pt.getName(); OriginalType originalType = type.getOriginalType(); switch (method) { case "getBoolean": Boolean valueBoolean = group.getBoolean(index, 0); dmRecord.set(index, valueBoolean); break; case "getFloat": Float valueFloat = group.getFloat(index, 0); break; case "getDouble": Double valueDouble = group.getDouble(index, 0); String valueToString = group.getValueToString(index, 0); dmRecord.set(index, valueToString); break; case "getLong": Long valueLong = group.getLong(index, 0); dmRecord.set(index, valueLong); break; case "getBinary": Binary valueBinary = group.getBinary(index, 0); LOG.info("value(Binary):" + valueBinary.toString()); if (originalType == OriginalType.ENUM.UTF8) { LOG.info("We have a String"); byte[] bytes = valueBinary.getBytes(); String valueToStringUTF8 = new String(bytes, "UTF-8"); dmRecord.set(index, valueToStringUTF8); } else { dmRecord.set(index, valueBinary.getBytes()); } break; default: valueToString = group.getValueToString(index, 0); dmRecord.set(index, valueToString); break; } } } } return dmRecord; Question: How to I do this with the FSDataInputStream instead of Path? Seems like Path is baked in? I have the requirement to work with FSDataInputStream over Path and File. Thank You Best Regards -- Daniel St. John Senior Software Engineer, RedPoint Global Inc. 1515 Walnut Street | Suite 300 | Boulder, CO 80302-5429 C: +719 439 7825 Skype/email: daniel.stj...@redpoint.net www.redpoint.net<http://www.redpoint.net/> From: "Daniel St. John" <daniel.stj...@redpoint.net<mailto:daniel.stj...@redpoint.net>> Date: Thursday, March 12, 2015 at 9:30 PM To: "dev@parquet.incubator.apache.org<mailto:dev@parquet.incubator.apache.org>" <dev@parquet.incubator.apache.org<mailto:dev@parquet.incubator.apache.org>> Subject: UTF8 and Parquet Hello, Thank You for hearing me. I am creating a translator between Parquet and out proprietary record format here at RedPoint. I create a Parquet file using the message definition to define the schema for the parquet file like so: message m { optional int64 id; optional binary name (UTF8); optional binary address (UTF8); … } Now the UTF8 annotation is accessed through the OriginalType information from the type. The idea is that for for BINARY primitive type I could query the OriginalType information to translate the binary to text. However when I open a Parquet file that has a schema that was originally annotated with UTF8 specifiers the Schema queried from the footer is missing the OrginalType information. I understand that at the storage level the annotation is mean-less, but at the object model layer it is critical for the proper translation to our types. Thank You, Regards Daniel -- Daniel St. John Senior Software Engineer, RedPoint Global Inc. 1515 Walnut Street | Suite 300 | Boulder, CO 80302-5429 C: +719 439 7825 Skype/email: daniel.stj...@redpoint.net<mailto:daniel.stj...@redpoint.net> www.redpoint.net<http://www.redpoint.net/>