Re: Specifying a projection in Java API

2018-04-13 Thread Andy Grove
Thanks. I tried this.

val projection: Seq[column.ColumnDescriptor] = //filter the columns I 
want from the schema

val projectionBuilder = Types.buildMessage()
for (col <- projection) {
  projectionBuilder.addField(Types.buildMessage().named(col.getPath.head))
}
r.setRequestedSchema(projectionBuilder.named("tbd"))

This fails when reading the file with "[some_col_name] optional int64 
some_col_name is not in the store" where "some_col_name" is not part of my 
projection.

Any idea what I need to do next?

Thanks,

Andy.

On 4/13/18, 12:08 PM, "Ryan Blue"  wrote:

I'd suggest using the Types builders to create your projection schema
(MessageType), then passing that schema to the
ParquetFileReader.setRequestedSchema method you found.

On Fri, Apr 13, 2018 at 10:40 AM, Andy Grove  wrote:

> Hi Ryan,
>
> I'm writing some low-level performance tests to try and find a bottleneck
> on our platform and have intentionally excluded Spark/Thrift/Presto etc 
and
> want to test Parquet directly both with local files and against our HDFS
> cluster to get performance metrics. Our parquet files were created by 
Spark
> and contain schema meta-data.
>
> Here is my code for opening the file:
>
> val footer = ParquetFileReader.open(file, options)
> val schema = footer.getFileMetaData.getSchema
> val r = new ParquetFileReader(file, options)
>
> I can call schema.getColumns and see all of the column definitions.
>
> I have my query working fine but it is reading all the columns and I want
> to push down the projection so it only reads the 5 columns I need.
>
> I see that there are some versions of the ParquetFileReader constructors
> that accept a List[ColumnDescriptor] and I did try that but ran into 
errors.
>
> What would you suggest?
>
> Thanks,
>
> Andy.
>
>
> On 4/13/18, 11:34 AM, "Ryan Blue"  wrote:
>
> Andy, what object model are you using to read? Usually you don't have 
a
> list of column descriptors, you have an Avro read schema or a Thrift
> class
> or something.
>
> On Fri, Apr 13, 2018 at 10:31 AM, Andy Grove 
> wrote:
>
> > Hi,
> >
> > I’m trying to read a parquet file with a projection from Scala and I
> can’t
> > find docs or examples for the correct way to do this.
> >
> > I have the file schema and have filtered for the list of columns I
> need,
> > so I have a List of ColumnDescriptors.
> >
> > It looks like I should call ParquetFileReader.setRequestedSchema()
> but I
> > can’t find an example of constructing the required MessageType
> parameter.
> >
> > I’d appreciate any pointers on what to do next.
> >
> > Thanks,
> >
> > Andy.
> >
> >
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
>
>


-- 
Ryan Blue
Software Engineer
Netflix




Re: Specifying a projection in Java API

2018-04-13 Thread Andy Grove
OK sorry for all the messages but I have this working now:


On 4/13/18, 12:59 PM, "Andy Grove"  wrote:

Immediately after sending this I realized that I also needed to pass the 
projection message type in the following lines:

  val columnIO = new ColumnIOFactory().getColumnIO(projectionType)

  val recordReader = columnIO.getRecordReader(pages, new 
GroupRecordConverter(projectionType))

I feel like I am getting close. Current failure is:

Exception in thread "main" java.lang.RuntimeException: not found 
2(my_projected_column) element number 0 in group:

at 
org.apache.parquet.example.data.simple.SimpleGroup.getValue(SimpleGroup.java:97)
at 
org.apache.parquet.example.data.simple.SimpleGroup.getInteger(SimpleGroup.java:129)
at 
org.apache.parquet.example.data.GroupValueSource.getInteger(GroupValueSource.java:39)

On 4/13/18, 12:56 PM, "Andy Grove"  wrote:

Thanks. I tried this.

val projection: Seq[column.ColumnDescriptor] = //filter the 
columns I want from the schema

val projectionBuilder = Types.buildMessage()
for (col <- projection) {
  
projectionBuilder.addField(Types.buildMessage().named(col.getPath.head))
}
r.setRequestedSchema(projectionBuilder.named("tbd"))

This fails when reading the file with "[some_col_name] optional int64 
some_col_name is not in the store" where "some_col_name" is not part of my 
projection.

Any idea what I need to do next?

Thanks,

Andy.

On 4/13/18, 12:08 PM, "Ryan Blue"  wrote:

I'd suggest using the Types builders to create your projection 
schema
(MessageType), then passing that schema to the
ParquetFileReader.setRequestedSchema method you found.

On Fri, Apr 13, 2018 at 10:40 AM, Andy Grove  
wrote:

> Hi Ryan,
>
> I'm writing some low-level performance tests to try and find a 
bottleneck
> on our platform and have intentionally excluded 
Spark/Thrift/Presto etc and
> want to test Parquet directly both with local files and against 
our HDFS
> cluster to get performance metrics. Our parquet files were 
created by Spark
> and contain schema meta-data.
>
> Here is my code for opening the file:
>
> val footer = ParquetFileReader.open(file, options)
> val schema = footer.getFileMetaData.getSchema
> val r = new ParquetFileReader(file, options)
>
> I can call schema.getColumns and see all of the column 
definitions.
>
> I have my query working fine but it is reading all the columns 
and I want
> to push down the projection so it only reads the 5 columns I need.
>
> I see that there are some versions of the ParquetFileReader 
constructors
> that accept a List[ColumnDescriptor] and I did try that but ran 
into errors.
>
> What would you suggest?
>
> Thanks,
>
> Andy.
>
>
> On 4/13/18, 11:34 AM, "Ryan Blue"  
wrote:
>
> Andy, what object model are you using to read? Usually you 
don't have a
> list of column descriptors, you have an Avro read schema or a 
Thrift
> class
> or something.
>
> On Fri, Apr 13, 2018 at 10:31 AM, Andy Grove 

> wrote:
>
> > Hi,
> >
> > I’m trying to read a parquet file with a projection from 
Scala and I
> can’t
> > find docs or examples for the correct way to do this.
> >
> > I have the file schema and have filtered for the list of 
columns I
> need,
> > so I have a List of ColumnDescriptors.
> >
> > It looks like I should call 
ParquetFileReader.setRequestedSchema()
> but I
> > can’t find an example of constructing the required 
MessageType
> parameter.
> >
> > I’d appreciate any pointers on what to do next.
> >
> > Thanks,
> >
> > Andy.
> >
> >
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
>
>

   

Specifying a projection in Java API

2018-04-13 Thread Andy Grove
Hi,

I’m trying to read a parquet file with a projection from Scala and I can’t find 
docs or examples for the correct way to do this.

I have the file schema and have filtered for the list of columns I need, so I 
have a List of ColumnDescriptors.

It looks like I should call ParquetFileReader.setRequestedSchema() but I can’t 
find an example of constructing the required MessageType parameter.

I’d appreciate any pointers on what to do next.

Thanks,

Andy.




Re: Specifying a projection in Java API

2018-04-13 Thread Andy Grove
Hi Ryan,

I'm writing some low-level performance tests to try and find a bottleneck on 
our platform and have intentionally excluded Spark/Thrift/Presto etc and want 
to test Parquet directly both with local files and against our HDFS cluster to 
get performance metrics. Our parquet files were created by Spark and contain 
schema meta-data.

Here is my code for opening the file:

val footer = ParquetFileReader.open(file, options)
val schema = footer.getFileMetaData.getSchema
val r = new ParquetFileReader(file, options)

I can call schema.getColumns and see all of the column definitions.

I have my query working fine but it is reading all the columns and I want to 
push down the projection so it only reads the 5 columns I need.

I see that there are some versions of the ParquetFileReader constructors that 
accept a List[ColumnDescriptor] and I did try that but ran into errors.

What would you suggest?

Thanks,

Andy.


On 4/13/18, 11:34 AM, "Ryan Blue"  wrote:

Andy, what object model are you using to read? Usually you don't have a
list of column descriptors, you have an Avro read schema or a Thrift class
or something.

On Fri, Apr 13, 2018 at 10:31 AM, Andy Grove  wrote:

> Hi,
>
> I’m trying to read a parquet file with a projection from Scala and I can’t
> find docs or examples for the correct way to do this.
>
> I have the file schema and have filtered for the list of columns I need,
> so I have a List of ColumnDescriptors.
>
> It looks like I should call ParquetFileReader.setRequestedSchema() but I
> can’t find an example of constructing the required MessageType parameter.
>
> I’d appreciate any pointers on what to do next.
>
> Thanks,
>
> Andy.
>
>
>


-- 
Ryan Blue
Software Engineer
Netflix




Re: Specifying a projection in Java API

2018-04-13 Thread Ryan Blue
Andy, what object model are you using to read? Usually you don't have a
list of column descriptors, you have an Avro read schema or a Thrift class
or something.

On Fri, Apr 13, 2018 at 10:31 AM, Andy Grove  wrote:

> Hi,
>
> I’m trying to read a parquet file with a projection from Scala and I can’t
> find docs or examples for the correct way to do this.
>
> I have the file schema and have filtered for the list of columns I need,
> so I have a List of ColumnDescriptors.
>
> It looks like I should call ParquetFileReader.setRequestedSchema() but I
> can’t find an example of constructing the required MessageType parameter.
>
> I’d appreciate any pointers on what to do next.
>
> Thanks,
>
> Andy.
>
>
>


-- 
Ryan Blue
Software Engineer
Netflix


Re: Specifying a projection in Java API

2018-04-13 Thread Ryan Blue
I'd suggest using the Types builders to create your projection schema
(MessageType), then passing that schema to the
ParquetFileReader.setRequestedSchema method you found.

On Fri, Apr 13, 2018 at 10:40 AM, Andy Grove  wrote:

> Hi Ryan,
>
> I'm writing some low-level performance tests to try and find a bottleneck
> on our platform and have intentionally excluded Spark/Thrift/Presto etc and
> want to test Parquet directly both with local files and against our HDFS
> cluster to get performance metrics. Our parquet files were created by Spark
> and contain schema meta-data.
>
> Here is my code for opening the file:
>
> val footer = ParquetFileReader.open(file, options)
> val schema = footer.getFileMetaData.getSchema
> val r = new ParquetFileReader(file, options)
>
> I can call schema.getColumns and see all of the column definitions.
>
> I have my query working fine but it is reading all the columns and I want
> to push down the projection so it only reads the 5 columns I need.
>
> I see that there are some versions of the ParquetFileReader constructors
> that accept a List[ColumnDescriptor] and I did try that but ran into errors.
>
> What would you suggest?
>
> Thanks,
>
> Andy.
>
>
> On 4/13/18, 11:34 AM, "Ryan Blue"  wrote:
>
> Andy, what object model are you using to read? Usually you don't have a
> list of column descriptors, you have an Avro read schema or a Thrift
> class
> or something.
>
> On Fri, Apr 13, 2018 at 10:31 AM, Andy Grove 
> wrote:
>
> > Hi,
> >
> > I’m trying to read a parquet file with a projection from Scala and I
> can’t
> > find docs or examples for the correct way to do this.
> >
> > I have the file schema and have filtered for the list of columns I
> need,
> > so I have a List of ColumnDescriptors.
> >
> > It looks like I should call ParquetFileReader.setRequestedSchema()
> but I
> > can’t find an example of constructing the required MessageType
> parameter.
> >
> > I’d appreciate any pointers on what to do next.
> >
> > Thanks,
> >
> > Andy.
> >
> >
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
>
>


-- 
Ryan Blue
Software Engineer
Netflix


Re: Specifying a projection in Java API

2018-04-13 Thread Andy Grove
Immediately after sending this I realized that I also needed to pass the 
projection message type in the following lines:

  val columnIO = new ColumnIOFactory().getColumnIO(projectionType)

  val recordReader = columnIO.getRecordReader(pages, new 
GroupRecordConverter(projectionType))

I feel like I am getting close. Current failure is:

Exception in thread "main" java.lang.RuntimeException: not found 
2(my_projected_column) element number 0 in group:

at 
org.apache.parquet.example.data.simple.SimpleGroup.getValue(SimpleGroup.java:97)
at 
org.apache.parquet.example.data.simple.SimpleGroup.getInteger(SimpleGroup.java:129)
at 
org.apache.parquet.example.data.GroupValueSource.getInteger(GroupValueSource.java:39)

On 4/13/18, 12:56 PM, "Andy Grove"  wrote:

Thanks. I tried this.

val projection: Seq[column.ColumnDescriptor] = //filter the columns 
I want from the schema

val projectionBuilder = Types.buildMessage()
for (col <- projection) {
  
projectionBuilder.addField(Types.buildMessage().named(col.getPath.head))
}
r.setRequestedSchema(projectionBuilder.named("tbd"))

This fails when reading the file with "[some_col_name] optional int64 
some_col_name is not in the store" where "some_col_name" is not part of my 
projection.

Any idea what I need to do next?

Thanks,

Andy.

On 4/13/18, 12:08 PM, "Ryan Blue"  wrote:

I'd suggest using the Types builders to create your projection schema
(MessageType), then passing that schema to the
ParquetFileReader.setRequestedSchema method you found.

On Fri, Apr 13, 2018 at 10:40 AM, Andy Grove  wrote:

> Hi Ryan,
>
> I'm writing some low-level performance tests to try and find a 
bottleneck
> on our platform and have intentionally excluded Spark/Thrift/Presto 
etc and
> want to test Parquet directly both with local files and against our 
HDFS
> cluster to get performance metrics. Our parquet files were created by 
Spark
> and contain schema meta-data.
>
> Here is my code for opening the file:
>
> val footer = ParquetFileReader.open(file, options)
> val schema = footer.getFileMetaData.getSchema
> val r = new ParquetFileReader(file, options)
>
> I can call schema.getColumns and see all of the column definitions.
>
> I have my query working fine but it is reading all the columns and I 
want
> to push down the projection so it only reads the 5 columns I need.
>
> I see that there are some versions of the ParquetFileReader 
constructors
> that accept a List[ColumnDescriptor] and I did try that but ran into 
errors.
>
> What would you suggest?
>
> Thanks,
>
> Andy.
>
>
> On 4/13/18, 11:34 AM, "Ryan Blue"  wrote:
>
> Andy, what object model are you using to read? Usually you don't 
have a
> list of column descriptors, you have an Avro read schema or a 
Thrift
> class
> or something.
>
> On Fri, Apr 13, 2018 at 10:31 AM, Andy Grove 
> wrote:
>
> > Hi,
> >
> > I’m trying to read a parquet file with a projection from Scala 
and I
> can’t
> > find docs or examples for the correct way to do this.
> >
> > I have the file schema and have filtered for the list of 
columns I
> need,
> > so I have a List of ColumnDescriptors.
> >
> > It looks like I should call 
ParquetFileReader.setRequestedSchema()
> but I
> > can’t find an example of constructing the required MessageType
> parameter.
> >
> > I’d appreciate any pointers on what to do next.
> >
> > Thanks,
> >
> > Andy.
> >
> >
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
>
>


-- 
Ryan Blue
Software Engineer
Netflix






[jira] [Updated] (PARQUET-1244) Documentation link to logical types broken

2018-04-13 Thread Antoine Pitrou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-1244:

Labels: beginner  (was: )

> Documentation link to logical types broken
> --
>
> Key: PARQUET-1244
> URL: https://issues.apache.org/jira/browse/PARQUET-1244
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Antoine Pitrou
>Priority: Minor
>  Labels: beginner
>
> The link to {{LogicalTypes.md}} here is broken:
> https://parquet.apache.org/documentation/latest/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PARQUET-1269) [C++] Scanning fails with list columns

2018-04-13 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created PARQUET-1269:
---

 Summary: [C++] Scanning fails with list columns
 Key: PARQUET-1269
 URL: https://issues.apache.org/jira/browse/PARQUET-1269
 Project: Parquet
  Issue Type: Bug
  Components: parquet-cpp
Reporter: Antoine Pitrou


{code:python}
>>> list_arr = pa.array([[1, 2], [3, 4, 5]])
>>> int_arr = pa.array([10, 11])
>>> table = pa.Table.from_arrays([int_arr, list_arr], ['ints', 'lists'])
>>> bio = io.BytesIO()
>>> pq.write_table(table, bio)
>>> bio.seek(0)
0
>>> reader = pq.ParquetReader()
>>> reader.open(bio)
>>> reader.scan_contents()
Traceback (most recent call last):
  File "", line 1, in 
reader.scan_contents()
  File "_parquet.pyx", line 753, in pyarrow._parquet.ParquetReader.scan_contents
  File "error.pxi", line 79, in pyarrow.lib.check_status
ArrowIOError: Parquet error: Total rows among columns do not match
{code}

ScanFileContents() claims it returns the "number of semantic rows" but 
apparently it actually counts the number of physical elements?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PARQUET-1270) [C++] Executable tools do not get installed

2018-04-13 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created PARQUET-1270:
---

 Summary: [C++] Executable tools do not get installed
 Key: PARQUET-1270
 URL: https://issues.apache.org/jira/browse/PARQUET-1270
 Project: Parquet
  Issue Type: Bug
  Components: parquet-cpp
Reporter: Antoine Pitrou


I have the following build script:
{code:bash}
mkdir -p build-debug
pushd build-debug

cmake -DCMAKE_BUILD_TYPE=debug \
  -DCMAKE_INSTALL_PREFIX=$PARQUET_HOME \
  -DPARQUET_BUILD_BENCHMARKS=off \
  -DPARQUET_BUILD_EXECUTABLES=on \
  -DPARQUET_BUILD_TESTS=on \
  ..

make -j16
make install
popd
{code}

parquet_reader does get built:
{code:bash}
$ find -name parquet_reader
./build-debug/debug/parquet_reader
{code}

but it isn't installed:
{code:bash}
$ find $PARQUET_HOME -name parquet_reader
$
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1270) [C++] Executable tools do not get installed

2018-04-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16437192#comment-16437192
 ] 

ASF GitHub Bot commented on PARQUET-1270:
-

pitrou opened a new pull request #455: PARQUET-1270: Install executable tools
URL: https://github.com/apache/parquet-cpp/pull/455
 
 
   "parquet_reader" and friends should be installed along with the Parquet 
libraries.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Executable tools do not get installed
> ---
>
> Key: PARQUET-1270
> URL: https://issues.apache.org/jira/browse/PARQUET-1270
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Antoine Pitrou
>Priority: Major
>
> I have the following build script:
> {code:bash}
> mkdir -p build-debug
> pushd build-debug
> cmake -DCMAKE_BUILD_TYPE=debug \
>   -DCMAKE_INSTALL_PREFIX=$PARQUET_HOME \
>   -DPARQUET_BUILD_BENCHMARKS=off \
>   -DPARQUET_BUILD_EXECUTABLES=on \
>   -DPARQUET_BUILD_TESTS=on \
>   ..
> make -j16
> make install
> popd
> {code}
> parquet_reader does get built:
> {code:bash}
> $ find -name parquet_reader
> ./build-debug/debug/parquet_reader
> {code}
> but it isn't installed:
> {code:bash}
> $ find $PARQUET_HOME -name parquet_reader
> $
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PARQUET-1271) [C++] "parquet_reader" should be "parquet-reader"

2018-04-13 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created PARQUET-1271:
---

 Summary: [C++] "parquet_reader" should be "parquet-reader"
 Key: PARQUET-1271
 URL: https://issues.apache.org/jira/browse/PARQUET-1271
 Project: Parquet
  Issue Type: Wish
  Components: parquet-cpp
Reporter: Antoine Pitrou


Out of "parquet-dump-schema", "parquet_reader" and "parquet-scan", 
"parquet_reader" gratuitously follows a different naming convention.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-968) Add Hive/Presto support in ProtoParquet

2018-04-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16437369#comment-16437369
 ] 

ASF GitHub Bot commented on PARQUET-968:


costimuraru commented on issue #411: PARQUET-968 Add Hive/Presto support in 
ProtoParquet
URL: https://github.com/apache/parquet-mr/pull/411#issuecomment-381155272
 
 
   Would be great if we could merge this by Apr 29, when this PR will turn one 
year :D


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add Hive/Presto support in ProtoParquet
> ---
>
> Key: PARQUET-968
> URL: https://issues.apache.org/jira/browse/PARQUET-968
> Project: Parquet
>  Issue Type: Task
>Reporter: Constantin Muraru
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-968) Add Hive/Presto support in ProtoParquet

2018-04-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16437294#comment-16437294
 ] 

ASF GitHub Bot commented on PARQUET-968:


costimuraru commented on issue #411: PARQUET-968 Add Hive/Presto support in 
ProtoParquet
URL: https://github.com/apache/parquet-mr/pull/411#issuecomment-381134753
 
 
   @BenoitHanotte sounds awesome! I successfully tested this final patch and it 
works great with AWS Athena (Presto).


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add Hive/Presto support in ProtoParquet
> ---
>
> Key: PARQUET-968
> URL: https://issues.apache.org/jira/browse/PARQUET-968
> Project: Parquet
>  Issue Type: Task
>Reporter: Constantin Muraru
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)