Re: Specifying a projection in Java API
Thanks. I tried this. val projection: Seq[column.ColumnDescriptor] = //filter the columns I want from the schema val projectionBuilder = Types.buildMessage() for (col <- projection) { projectionBuilder.addField(Types.buildMessage().named(col.getPath.head)) } r.setRequestedSchema(projectionBuilder.named("tbd")) This fails when reading the file with "[some_col_name] optional int64 some_col_name is not in the store" where "some_col_name" is not part of my projection. Any idea what I need to do next? Thanks, Andy. On 4/13/18, 12:08 PM, "Ryan Blue"wrote: I'd suggest using the Types builders to create your projection schema (MessageType), then passing that schema to the ParquetFileReader.setRequestedSchema method you found. On Fri, Apr 13, 2018 at 10:40 AM, Andy Grove wrote: > Hi Ryan, > > I'm writing some low-level performance tests to try and find a bottleneck > on our platform and have intentionally excluded Spark/Thrift/Presto etc and > want to test Parquet directly both with local files and against our HDFS > cluster to get performance metrics. Our parquet files were created by Spark > and contain schema meta-data. > > Here is my code for opening the file: > > val footer = ParquetFileReader.open(file, options) > val schema = footer.getFileMetaData.getSchema > val r = new ParquetFileReader(file, options) > > I can call schema.getColumns and see all of the column definitions. > > I have my query working fine but it is reading all the columns and I want > to push down the projection so it only reads the 5 columns I need. > > I see that there are some versions of the ParquetFileReader constructors > that accept a List[ColumnDescriptor] and I did try that but ran into errors. > > What would you suggest? > > Thanks, > > Andy. > > > On 4/13/18, 11:34 AM, "Ryan Blue" wrote: > > Andy, what object model are you using to read? Usually you don't have a > list of column descriptors, you have an Avro read schema or a Thrift > class > or something. > > On Fri, Apr 13, 2018 at 10:31 AM, Andy Grove > wrote: > > > Hi, > > > > I’m trying to read a parquet file with a projection from Scala and I > can’t > > find docs or examples for the correct way to do this. > > > > I have the file schema and have filtered for the list of columns I > need, > > so I have a List of ColumnDescriptors. > > > > It looks like I should call ParquetFileReader.setRequestedSchema() > but I > > can’t find an example of constructing the required MessageType > parameter. > > > > I’d appreciate any pointers on what to do next. > > > > Thanks, > > > > Andy. > > > > > > > > > -- > Ryan Blue > Software Engineer > Netflix > > > -- Ryan Blue Software Engineer Netflix
Re: Specifying a projection in Java API
OK sorry for all the messages but I have this working now: On 4/13/18, 12:59 PM, "Andy Grove"wrote: Immediately after sending this I realized that I also needed to pass the projection message type in the following lines: val columnIO = new ColumnIOFactory().getColumnIO(projectionType) val recordReader = columnIO.getRecordReader(pages, new GroupRecordConverter(projectionType)) I feel like I am getting close. Current failure is: Exception in thread "main" java.lang.RuntimeException: not found 2(my_projected_column) element number 0 in group: at org.apache.parquet.example.data.simple.SimpleGroup.getValue(SimpleGroup.java:97) at org.apache.parquet.example.data.simple.SimpleGroup.getInteger(SimpleGroup.java:129) at org.apache.parquet.example.data.GroupValueSource.getInteger(GroupValueSource.java:39) On 4/13/18, 12:56 PM, "Andy Grove" wrote: Thanks. I tried this. val projection: Seq[column.ColumnDescriptor] = //filter the columns I want from the schema val projectionBuilder = Types.buildMessage() for (col <- projection) { projectionBuilder.addField(Types.buildMessage().named(col.getPath.head)) } r.setRequestedSchema(projectionBuilder.named("tbd")) This fails when reading the file with "[some_col_name] optional int64 some_col_name is not in the store" where "some_col_name" is not part of my projection. Any idea what I need to do next? Thanks, Andy. On 4/13/18, 12:08 PM, "Ryan Blue" wrote: I'd suggest using the Types builders to create your projection schema (MessageType), then passing that schema to the ParquetFileReader.setRequestedSchema method you found. On Fri, Apr 13, 2018 at 10:40 AM, Andy Grove wrote: > Hi Ryan, > > I'm writing some low-level performance tests to try and find a bottleneck > on our platform and have intentionally excluded Spark/Thrift/Presto etc and > want to test Parquet directly both with local files and against our HDFS > cluster to get performance metrics. Our parquet files were created by Spark > and contain schema meta-data. > > Here is my code for opening the file: > > val footer = ParquetFileReader.open(file, options) > val schema = footer.getFileMetaData.getSchema > val r = new ParquetFileReader(file, options) > > I can call schema.getColumns and see all of the column definitions. > > I have my query working fine but it is reading all the columns and I want > to push down the projection so it only reads the 5 columns I need. > > I see that there are some versions of the ParquetFileReader constructors > that accept a List[ColumnDescriptor] and I did try that but ran into errors. > > What would you suggest? > > Thanks, > > Andy. > > > On 4/13/18, 11:34 AM, "Ryan Blue" wrote: > > Andy, what object model are you using to read? Usually you don't have a > list of column descriptors, you have an Avro read schema or a Thrift > class > or something. > > On Fri, Apr 13, 2018 at 10:31 AM, Andy Grove > wrote: > > > Hi, > > > > I’m trying to read a parquet file with a projection from Scala and I > can’t > > find docs or examples for the correct way to do this. > > > > I have the file schema and have filtered for the list of columns I > need, > > so I have a List of ColumnDescriptors. > > > > It looks like I should call ParquetFileReader.setRequestedSchema() > but I > > can’t find an example of constructing the required MessageType > parameter. > > > > I’d appreciate any pointers on what to do next. > > > > Thanks, > > > > Andy. > > > > > > > > > -- > Ryan Blue > Software Engineer > Netflix > > >
Specifying a projection in Java API
Hi, I’m trying to read a parquet file with a projection from Scala and I can’t find docs or examples for the correct way to do this. I have the file schema and have filtered for the list of columns I need, so I have a List of ColumnDescriptors. It looks like I should call ParquetFileReader.setRequestedSchema() but I can’t find an example of constructing the required MessageType parameter. I’d appreciate any pointers on what to do next. Thanks, Andy.
Re: Specifying a projection in Java API
Hi Ryan, I'm writing some low-level performance tests to try and find a bottleneck on our platform and have intentionally excluded Spark/Thrift/Presto etc and want to test Parquet directly both with local files and against our HDFS cluster to get performance metrics. Our parquet files were created by Spark and contain schema meta-data. Here is my code for opening the file: val footer = ParquetFileReader.open(file, options) val schema = footer.getFileMetaData.getSchema val r = new ParquetFileReader(file, options) I can call schema.getColumns and see all of the column definitions. I have my query working fine but it is reading all the columns and I want to push down the projection so it only reads the 5 columns I need. I see that there are some versions of the ParquetFileReader constructors that accept a List[ColumnDescriptor] and I did try that but ran into errors. What would you suggest? Thanks, Andy. On 4/13/18, 11:34 AM, "Ryan Blue"wrote: Andy, what object model are you using to read? Usually you don't have a list of column descriptors, you have an Avro read schema or a Thrift class or something. On Fri, Apr 13, 2018 at 10:31 AM, Andy Grove wrote: > Hi, > > I’m trying to read a parquet file with a projection from Scala and I can’t > find docs or examples for the correct way to do this. > > I have the file schema and have filtered for the list of columns I need, > so I have a List of ColumnDescriptors. > > It looks like I should call ParquetFileReader.setRequestedSchema() but I > can’t find an example of constructing the required MessageType parameter. > > I’d appreciate any pointers on what to do next. > > Thanks, > > Andy. > > > -- Ryan Blue Software Engineer Netflix
Re: Specifying a projection in Java API
Andy, what object model are you using to read? Usually you don't have a list of column descriptors, you have an Avro read schema or a Thrift class or something. On Fri, Apr 13, 2018 at 10:31 AM, Andy Grovewrote: > Hi, > > I’m trying to read a parquet file with a projection from Scala and I can’t > find docs or examples for the correct way to do this. > > I have the file schema and have filtered for the list of columns I need, > so I have a List of ColumnDescriptors. > > It looks like I should call ParquetFileReader.setRequestedSchema() but I > can’t find an example of constructing the required MessageType parameter. > > I’d appreciate any pointers on what to do next. > > Thanks, > > Andy. > > > -- Ryan Blue Software Engineer Netflix
Re: Specifying a projection in Java API
I'd suggest using the Types builders to create your projection schema (MessageType), then passing that schema to the ParquetFileReader.setRequestedSchema method you found. On Fri, Apr 13, 2018 at 10:40 AM, Andy Grovewrote: > Hi Ryan, > > I'm writing some low-level performance tests to try and find a bottleneck > on our platform and have intentionally excluded Spark/Thrift/Presto etc and > want to test Parquet directly both with local files and against our HDFS > cluster to get performance metrics. Our parquet files were created by Spark > and contain schema meta-data. > > Here is my code for opening the file: > > val footer = ParquetFileReader.open(file, options) > val schema = footer.getFileMetaData.getSchema > val r = new ParquetFileReader(file, options) > > I can call schema.getColumns and see all of the column definitions. > > I have my query working fine but it is reading all the columns and I want > to push down the projection so it only reads the 5 columns I need. > > I see that there are some versions of the ParquetFileReader constructors > that accept a List[ColumnDescriptor] and I did try that but ran into errors. > > What would you suggest? > > Thanks, > > Andy. > > > On 4/13/18, 11:34 AM, "Ryan Blue" wrote: > > Andy, what object model are you using to read? Usually you don't have a > list of column descriptors, you have an Avro read schema or a Thrift > class > or something. > > On Fri, Apr 13, 2018 at 10:31 AM, Andy Grove > wrote: > > > Hi, > > > > I’m trying to read a parquet file with a projection from Scala and I > can’t > > find docs or examples for the correct way to do this. > > > > I have the file schema and have filtered for the list of columns I > need, > > so I have a List of ColumnDescriptors. > > > > It looks like I should call ParquetFileReader.setRequestedSchema() > but I > > can’t find an example of constructing the required MessageType > parameter. > > > > I’d appreciate any pointers on what to do next. > > > > Thanks, > > > > Andy. > > > > > > > > > -- > Ryan Blue > Software Engineer > Netflix > > > -- Ryan Blue Software Engineer Netflix
Re: Specifying a projection in Java API
Immediately after sending this I realized that I also needed to pass the projection message type in the following lines: val columnIO = new ColumnIOFactory().getColumnIO(projectionType) val recordReader = columnIO.getRecordReader(pages, new GroupRecordConverter(projectionType)) I feel like I am getting close. Current failure is: Exception in thread "main" java.lang.RuntimeException: not found 2(my_projected_column) element number 0 in group: at org.apache.parquet.example.data.simple.SimpleGroup.getValue(SimpleGroup.java:97) at org.apache.parquet.example.data.simple.SimpleGroup.getInteger(SimpleGroup.java:129) at org.apache.parquet.example.data.GroupValueSource.getInteger(GroupValueSource.java:39) On 4/13/18, 12:56 PM, "Andy Grove"wrote: Thanks. I tried this. val projection: Seq[column.ColumnDescriptor] = //filter the columns I want from the schema val projectionBuilder = Types.buildMessage() for (col <- projection) { projectionBuilder.addField(Types.buildMessage().named(col.getPath.head)) } r.setRequestedSchema(projectionBuilder.named("tbd")) This fails when reading the file with "[some_col_name] optional int64 some_col_name is not in the store" where "some_col_name" is not part of my projection. Any idea what I need to do next? Thanks, Andy. On 4/13/18, 12:08 PM, "Ryan Blue" wrote: I'd suggest using the Types builders to create your projection schema (MessageType), then passing that schema to the ParquetFileReader.setRequestedSchema method you found. On Fri, Apr 13, 2018 at 10:40 AM, Andy Grove wrote: > Hi Ryan, > > I'm writing some low-level performance tests to try and find a bottleneck > on our platform and have intentionally excluded Spark/Thrift/Presto etc and > want to test Parquet directly both with local files and against our HDFS > cluster to get performance metrics. Our parquet files were created by Spark > and contain schema meta-data. > > Here is my code for opening the file: > > val footer = ParquetFileReader.open(file, options) > val schema = footer.getFileMetaData.getSchema > val r = new ParquetFileReader(file, options) > > I can call schema.getColumns and see all of the column definitions. > > I have my query working fine but it is reading all the columns and I want > to push down the projection so it only reads the 5 columns I need. > > I see that there are some versions of the ParquetFileReader constructors > that accept a List[ColumnDescriptor] and I did try that but ran into errors. > > What would you suggest? > > Thanks, > > Andy. > > > On 4/13/18, 11:34 AM, "Ryan Blue" wrote: > > Andy, what object model are you using to read? Usually you don't have a > list of column descriptors, you have an Avro read schema or a Thrift > class > or something. > > On Fri, Apr 13, 2018 at 10:31 AM, Andy Grove > wrote: > > > Hi, > > > > I’m trying to read a parquet file with a projection from Scala and I > can’t > > find docs or examples for the correct way to do this. > > > > I have the file schema and have filtered for the list of columns I > need, > > so I have a List of ColumnDescriptors. > > > > It looks like I should call ParquetFileReader.setRequestedSchema() > but I > > can’t find an example of constructing the required MessageType > parameter. > > > > I’d appreciate any pointers on what to do next. > > > > Thanks, > > > > Andy. > > > > > > > > > -- > Ryan Blue > Software Engineer > Netflix > > > -- Ryan Blue Software Engineer Netflix
[jira] [Updated] (PARQUET-1244) Documentation link to logical types broken
[ https://issues.apache.org/jira/browse/PARQUET-1244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated PARQUET-1244: Labels: beginner (was: ) > Documentation link to logical types broken > -- > > Key: PARQUET-1244 > URL: https://issues.apache.org/jira/browse/PARQUET-1244 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Antoine Pitrou >Priority: Minor > Labels: beginner > > The link to {{LogicalTypes.md}} here is broken: > https://parquet.apache.org/documentation/latest/ -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (PARQUET-1269) [C++] Scanning fails with list columns
Antoine Pitrou created PARQUET-1269: --- Summary: [C++] Scanning fails with list columns Key: PARQUET-1269 URL: https://issues.apache.org/jira/browse/PARQUET-1269 Project: Parquet Issue Type: Bug Components: parquet-cpp Reporter: Antoine Pitrou {code:python} >>> list_arr = pa.array([[1, 2], [3, 4, 5]]) >>> int_arr = pa.array([10, 11]) >>> table = pa.Table.from_arrays([int_arr, list_arr], ['ints', 'lists']) >>> bio = io.BytesIO() >>> pq.write_table(table, bio) >>> bio.seek(0) 0 >>> reader = pq.ParquetReader() >>> reader.open(bio) >>> reader.scan_contents() Traceback (most recent call last): File "", line 1, in reader.scan_contents() File "_parquet.pyx", line 753, in pyarrow._parquet.ParquetReader.scan_contents File "error.pxi", line 79, in pyarrow.lib.check_status ArrowIOError: Parquet error: Total rows among columns do not match {code} ScanFileContents() claims it returns the "number of semantic rows" but apparently it actually counts the number of physical elements? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (PARQUET-1270) [C++] Executable tools do not get installed
Antoine Pitrou created PARQUET-1270: --- Summary: [C++] Executable tools do not get installed Key: PARQUET-1270 URL: https://issues.apache.org/jira/browse/PARQUET-1270 Project: Parquet Issue Type: Bug Components: parquet-cpp Reporter: Antoine Pitrou I have the following build script: {code:bash} mkdir -p build-debug pushd build-debug cmake -DCMAKE_BUILD_TYPE=debug \ -DCMAKE_INSTALL_PREFIX=$PARQUET_HOME \ -DPARQUET_BUILD_BENCHMARKS=off \ -DPARQUET_BUILD_EXECUTABLES=on \ -DPARQUET_BUILD_TESTS=on \ .. make -j16 make install popd {code} parquet_reader does get built: {code:bash} $ find -name parquet_reader ./build-debug/debug/parquet_reader {code} but it isn't installed: {code:bash} $ find $PARQUET_HOME -name parquet_reader $ {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1270) [C++] Executable tools do not get installed
[ https://issues.apache.org/jira/browse/PARQUET-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16437192#comment-16437192 ] ASF GitHub Bot commented on PARQUET-1270: - pitrou opened a new pull request #455: PARQUET-1270: Install executable tools URL: https://github.com/apache/parquet-cpp/pull/455 "parquet_reader" and friends should be installed along with the Parquet libraries. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Executable tools do not get installed > --- > > Key: PARQUET-1270 > URL: https://issues.apache.org/jira/browse/PARQUET-1270 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp >Reporter: Antoine Pitrou >Priority: Major > > I have the following build script: > {code:bash} > mkdir -p build-debug > pushd build-debug > cmake -DCMAKE_BUILD_TYPE=debug \ > -DCMAKE_INSTALL_PREFIX=$PARQUET_HOME \ > -DPARQUET_BUILD_BENCHMARKS=off \ > -DPARQUET_BUILD_EXECUTABLES=on \ > -DPARQUET_BUILD_TESTS=on \ > .. > make -j16 > make install > popd > {code} > parquet_reader does get built: > {code:bash} > $ find -name parquet_reader > ./build-debug/debug/parquet_reader > {code} > but it isn't installed: > {code:bash} > $ find $PARQUET_HOME -name parquet_reader > $ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (PARQUET-1271) [C++] "parquet_reader" should be "parquet-reader"
Antoine Pitrou created PARQUET-1271: --- Summary: [C++] "parquet_reader" should be "parquet-reader" Key: PARQUET-1271 URL: https://issues.apache.org/jira/browse/PARQUET-1271 Project: Parquet Issue Type: Wish Components: parquet-cpp Reporter: Antoine Pitrou Out of "parquet-dump-schema", "parquet_reader" and "parquet-scan", "parquet_reader" gratuitously follows a different naming convention. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-968) Add Hive/Presto support in ProtoParquet
[ https://issues.apache.org/jira/browse/PARQUET-968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16437369#comment-16437369 ] ASF GitHub Bot commented on PARQUET-968: costimuraru commented on issue #411: PARQUET-968 Add Hive/Presto support in ProtoParquet URL: https://github.com/apache/parquet-mr/pull/411#issuecomment-381155272 Would be great if we could merge this by Apr 29, when this PR will turn one year :D This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add Hive/Presto support in ProtoParquet > --- > > Key: PARQUET-968 > URL: https://issues.apache.org/jira/browse/PARQUET-968 > Project: Parquet > Issue Type: Task >Reporter: Constantin Muraru >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-968) Add Hive/Presto support in ProtoParquet
[ https://issues.apache.org/jira/browse/PARQUET-968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16437294#comment-16437294 ] ASF GitHub Bot commented on PARQUET-968: costimuraru commented on issue #411: PARQUET-968 Add Hive/Presto support in ProtoParquet URL: https://github.com/apache/parquet-mr/pull/411#issuecomment-381134753 @BenoitHanotte sounds awesome! I successfully tested this final patch and it works great with AWS Athena (Presto). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add Hive/Presto support in ProtoParquet > --- > > Key: PARQUET-968 > URL: https://issues.apache.org/jira/browse/PARQUET-968 > Project: Parquet > Issue Type: Task >Reporter: Constantin Muraru >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)