Re: plan for Go implementation of Plasma

2018-12-19 Thread Kouhei Sutou
Hi,

GObject Plasma bindings mentioned by Philipp is the official
C bindings for Plasma (Plasma GLib):

  https://github.com/apache/arrow/tree/master/c_glib/plasma-glib

Ruby bindings use it. So we'll maintain and improve it.


There were examples to generate Go bindings for Arrow from
Arrow GLib automatically:

  https://github.com/apache/arrow/tree/apache-arrow-0.9.0/c_glib/example/go

I removed it because we started native Go bindings. The same
mechanism can be used for Plasma but I don't think we should
use it.

We'll be able to implement Go bindings with cgo and Plasma
GLib. I think that Go bindings for Plasma shouldn't export
GLib related API when we use Plasma GLib.

I can help when we use Plasma GLib.


Thanks,
--
kou

In 
  "Re: plan for Go implementation of Plasma" on Wed, 19 Dec 2018 17:23:58 -0500,
  Dustin Long  wrote:

> Neat! Thank you for the suggestions, I'll take a look into these other
> approaches. Sticking with cgo does sound promising; I had dismissed it due
> to needing to maintain a C interface, but if there's already some bindings
> that might become official, that negates that issue.
> 
> On Wed, Dec 19, 2018 at 3:26 PM Philipp Moritz  wrote:
> 
>> Hey Dustin,
>>
>> Thanks for getting in touch! Here are two additional ways to do it:
>>
>> 5. Native go client library: If Go has support to ship file descriptors
>> over unix domain sockets (which I think it has, see
>>
>> https://github.com/opencontainers/runc/blob/master/libcontainer/utils/cmsg.go
>> )
>> and interact with memory mapped files it might also be possible to make a
>> version of
>> https://github.com/apache/arrow/blob/master/cpp/src/plasma/client.cc that
>> is native go. The advantage is that it wouldn't need any additional
>> compilation steps on the go side, the disadvantage is that it would need to
>> be updated if the plasma client internals change (like they did recently
>> with the removal of the release buffer).
>>
>> 6. GObject wrapper: Possibly one could use the GObject plasma bindings
>> that kou and his team are managing to build a wrapper (not sure how
>> feasible that is or if there is a mature GObject go implementation).
>>
>> I would encourage you to start by writing write down the ideal Go API for
>> the client and then see how it can be implemented after that (to make sure
>> the API, which is the most important piece, is not influenced by the
>> implementation choice).
>>
>> Then, going the cgo route seems the most promising for me since that's I
>> think the route that most go code interfaces with native libraries. There
>> are some C bindings that have been written:
>> https://github.com/plures/pxnd/tree/master/libplasma. If they are useful
>> to
>> you, we can make a plan to integrate them into the repo.
>>
>> Best,
>> Philipp.
>>
>>
>>
>> On Wed, Dec 19, 2018 at 12:04 PM Dustin Long  wrote:
>>
>> > Hi all!
>> >
>> > I am a developer on qri , a data-science tool built on
>> > IPFS written in go. We're interested in integrating Arrow and especially
>> > Plasma, in order to be able to share datasets with other apps like
>> Jupyter
>> > Notebook. Having this functionality is going to be key for how we plan to
>> > integrate with existing frameworks.
>> >
>> > I've been investigating possible approaches for how to use Plasma in our
>> > codecase. I realize that Plasma is still a work in progress, and doesn't
>> > have stable API yet, but we're also a ways off from being ready to fully
>> > integrate it on our side. Just figured it would be good to start this
>> > conversation early in order to plan ahead for how development should
>> > proceed.
>> >
>> > So, the prototypes I've been hacking on have revealed a few choices of
>> how
>> > to make our golang codebase call Plasma's C++, and I wanted to see what
>> the
>> > Plasma devs think about these approaches or if they have any preference
>> for
>> > how the go bindings should be behave.
>> >
>> > Here are the options in order of what seems to be least to most usable:
>> >
>> > 1. cgo
>> >   Use go's builtin cgo facility to call the Plasma C++ implementation.
>> cgo
>> > is relatively easy to use, however it only can call C functions. So this
>> > would require writing and maintaining a pure C language wrapper around
>> the
>> > C++ functionality we want to expose. A lot would be lost in translation
>> and
>> > the resulting go code would look nothing like the original library.
>> >
>> > 2. dlopen
>> >   Install Plasma as a library on the user's system, then load the library
>> > at run-time, looking up function calls and data structures as needed.
>> > Removes the need for a static dependency, but still requires a lot of
>> shim
>> > code to be written to load the shared library calls. C++'s name mangling
>> > gets in the way a lot.
>> >
>> > 3. Swig
>> >   Wrap a swig interface file that exposes whatever functionality we want
>> to
>> > golang. The standard go tool has builtin swig support, which knows when
>> to
>> > invoke the swig 

[jira] [Created] (ARROW-4085) [GLib] Use "field" for struct data type

2018-12-19 Thread Kouhei Sutou (JIRA)
Kouhei Sutou created ARROW-4085:
---

 Summary: [GLib] Use "field" for struct data type
 Key: ARROW-4085
 URL: https://issues.apache.org/jira/browse/ARROW-4085
 Project: Apache Arrow
  Issue Type: Improvement
  Components: GLib
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou


Because C++ API is changed to use "field" by ARROW-3545.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Dictionary with repeated values?

2018-12-19 Thread Wes McKinney
The way that dictionary encoding is implemented in C++ (with
DictionaryType, DictionaryArray) is a construct particular to the
library.

At the protocol level, dictionary encoding is a property of field at
some level of a schema tree [1].

The dictionary itself is a record batch with a single field/column [2]

So based on the protocol there is no requirement for uniqueness in the
dictionary. I would say it would be preferable for implementations to
avoid constructing dictionaries with duplicates, though.

- Wes

[1]: https://github.com/apache/arrow/blob/master/format/Schema.fbs#L226
[2]: https://github.com/apache/arrow/blob/master/format/Message.fbs#L71

On Wed, Dec 19, 2018 at 5:51 PM Ben Kietzman  wrote:
>
> Is it legal to create a DictionaryType whose dictionary has repeated
> values?


Dictionary with repeated values?

2018-12-19 Thread Ben Kietzman
Is it legal to create a DictionaryType whose dictionary has repeated
values?


Re: plan for Go implementation of Plasma

2018-12-19 Thread Dustin Long
Neat! Thank you for the suggestions, I'll take a look into these other
approaches. Sticking with cgo does sound promising; I had dismissed it due
to needing to maintain a C interface, but if there's already some bindings
that might become official, that negates that issue.

On Wed, Dec 19, 2018 at 3:26 PM Philipp Moritz  wrote:

> Hey Dustin,
>
> Thanks for getting in touch! Here are two additional ways to do it:
>
> 5. Native go client library: If Go has support to ship file descriptors
> over unix domain sockets (which I think it has, see
>
> https://github.com/opencontainers/runc/blob/master/libcontainer/utils/cmsg.go
> )
> and interact with memory mapped files it might also be possible to make a
> version of
> https://github.com/apache/arrow/blob/master/cpp/src/plasma/client.cc that
> is native go. The advantage is that it wouldn't need any additional
> compilation steps on the go side, the disadvantage is that it would need to
> be updated if the plasma client internals change (like they did recently
> with the removal of the release buffer).
>
> 6. GObject wrapper: Possibly one could use the GObject plasma bindings
> that kou and his team are managing to build a wrapper (not sure how
> feasible that is or if there is a mature GObject go implementation).
>
> I would encourage you to start by writing write down the ideal Go API for
> the client and then see how it can be implemented after that (to make sure
> the API, which is the most important piece, is not influenced by the
> implementation choice).
>
> Then, going the cgo route seems the most promising for me since that's I
> think the route that most go code interfaces with native libraries. There
> are some C bindings that have been written:
> https://github.com/plures/pxnd/tree/master/libplasma. If they are useful
> to
> you, we can make a plan to integrate them into the repo.
>
> Best,
> Philipp.
>
>
>
> On Wed, Dec 19, 2018 at 12:04 PM Dustin Long  wrote:
>
> > Hi all!
> >
> > I am a developer on qri , a data-science tool built on
> > IPFS written in go. We're interested in integrating Arrow and especially
> > Plasma, in order to be able to share datasets with other apps like
> Jupyter
> > Notebook. Having this functionality is going to be key for how we plan to
> > integrate with existing frameworks.
> >
> > I've been investigating possible approaches for how to use Plasma in our
> > codecase. I realize that Plasma is still a work in progress, and doesn't
> > have stable API yet, but we're also a ways off from being ready to fully
> > integrate it on our side. Just figured it would be good to start this
> > conversation early in order to plan ahead for how development should
> > proceed.
> >
> > So, the prototypes I've been hacking on have revealed a few choices of
> how
> > to make our golang codebase call Plasma's C++, and I wanted to see what
> the
> > Plasma devs think about these approaches or if they have any preference
> for
> > how the go bindings should be behave.
> >
> > Here are the options in order of what seems to be least to most usable:
> >
> > 1. cgo
> >   Use go's builtin cgo facility to call the Plasma C++ implementation.
> cgo
> > is relatively easy to use, however it only can call C functions. So this
> > would require writing and maintaining a pure C language wrapper around
> the
> > C++ functionality we want to expose. A lot would be lost in translation
> and
> > the resulting go code would look nothing like the original library.
> >
> > 2. dlopen
> >   Install Plasma as a library on the user's system, then load the library
> > at run-time, looking up function calls and data structures as needed.
> > Removes the need for a static dependency, but still requires a lot of
> shim
> > code to be written to load the shared library calls. C++'s name mangling
> > gets in the way a lot.
> >
> > 3. Swig
> >   Wrap a swig interface file that exposes whatever functionality we want
> to
> > golang. The standard go tool has builtin swig support, which knows when
> to
> > invoke the swig generator, in order to create go bindings that resemble
> the
> > C++ original. The build process is relatively uninterrupted.
> >
> > I noticed there doesn't seem to be any swig in use currently in the arrow
> > codebase, which made me think there might have been a reason that it has
> > been avoided for other languages. I'm interested to hear any thoughts, or
> > see if there are other suggestions on how to proceed?
> >
> > Regards,
> > Dustin
> >
>


Re: plan for Go implementation of Plasma

2018-12-19 Thread Philipp Moritz
Hey Dustin,

Thanks for getting in touch! Here are two additional ways to do it:

5. Native go client library: If Go has support to ship file descriptors
over unix domain sockets (which I think it has, see
https://github.com/opencontainers/runc/blob/master/libcontainer/utils/cmsg.go)
and interact with memory mapped files it might also be possible to make a
version of
https://github.com/apache/arrow/blob/master/cpp/src/plasma/client.cc that
is native go. The advantage is that it wouldn't need any additional
compilation steps on the go side, the disadvantage is that it would need to
be updated if the plasma client internals change (like they did recently
with the removal of the release buffer).

6. GObject wrapper: Possibly one could use the GObject plasma bindings
that kou and his team are managing to build a wrapper (not sure how
feasible that is or if there is a mature GObject go implementation).

I would encourage you to start by writing write down the ideal Go API for
the client and then see how it can be implemented after that (to make sure
the API, which is the most important piece, is not influenced by the
implementation choice).

Then, going the cgo route seems the most promising for me since that's I
think the route that most go code interfaces with native libraries. There
are some C bindings that have been written:
https://github.com/plures/pxnd/tree/master/libplasma. If they are useful to
you, we can make a plan to integrate them into the repo.

Best,
Philipp.



On Wed, Dec 19, 2018 at 12:04 PM Dustin Long  wrote:

> Hi all!
>
> I am a developer on qri , a data-science tool built on
> IPFS written in go. We're interested in integrating Arrow and especially
> Plasma, in order to be able to share datasets with other apps like Jupyter
> Notebook. Having this functionality is going to be key for how we plan to
> integrate with existing frameworks.
>
> I've been investigating possible approaches for how to use Plasma in our
> codecase. I realize that Plasma is still a work in progress, and doesn't
> have stable API yet, but we're also a ways off from being ready to fully
> integrate it on our side. Just figured it would be good to start this
> conversation early in order to plan ahead for how development should
> proceed.
>
> So, the prototypes I've been hacking on have revealed a few choices of how
> to make our golang codebase call Plasma's C++, and I wanted to see what the
> Plasma devs think about these approaches or if they have any preference for
> how the go bindings should be behave.
>
> Here are the options in order of what seems to be least to most usable:
>
> 1. cgo
>   Use go's builtin cgo facility to call the Plasma C++ implementation. cgo
> is relatively easy to use, however it only can call C functions. So this
> would require writing and maintaining a pure C language wrapper around the
> C++ functionality we want to expose. A lot would be lost in translation and
> the resulting go code would look nothing like the original library.
>
> 2. dlopen
>   Install Plasma as a library on the user's system, then load the library
> at run-time, looking up function calls and data structures as needed.
> Removes the need for a static dependency, but still requires a lot of shim
> code to be written to load the shared library calls. C++'s name mangling
> gets in the way a lot.
>
> 3. Swig
>   Wrap a swig interface file that exposes whatever functionality we want to
> golang. The standard go tool has builtin swig support, which knows when to
> invoke the swig generator, in order to create go bindings that resemble the
> C++ original. The build process is relatively uninterrupted.
>
> I noticed there doesn't seem to be any swig in use currently in the arrow
> codebase, which made me think there might have been a reason that it has
> been avoided for other languages. I'm interested to hear any thoughts, or
> see if there are other suggestions on how to proceed?
>
> Regards,
> Dustin
>


plan for Go implementation of Plasma

2018-12-19 Thread Dustin Long
Hi all!

I am a developer on qri , a data-science tool built on
IPFS written in go. We're interested in integrating Arrow and especially
Plasma, in order to be able to share datasets with other apps like Jupyter
Notebook. Having this functionality is going to be key for how we plan to
integrate with existing frameworks.

I've been investigating possible approaches for how to use Plasma in our
codecase. I realize that Plasma is still a work in progress, and doesn't
have stable API yet, but we're also a ways off from being ready to fully
integrate it on our side. Just figured it would be good to start this
conversation early in order to plan ahead for how development should
proceed.

So, the prototypes I've been hacking on have revealed a few choices of how
to make our golang codebase call Plasma's C++, and I wanted to see what the
Plasma devs think about these approaches or if they have any preference for
how the go bindings should be behave.

Here are the options in order of what seems to be least to most usable:

1. cgo
  Use go's builtin cgo facility to call the Plasma C++ implementation. cgo
is relatively easy to use, however it only can call C functions. So this
would require writing and maintaining a pure C language wrapper around the
C++ functionality we want to expose. A lot would be lost in translation and
the resulting go code would look nothing like the original library.

2. dlopen
  Install Plasma as a library on the user's system, then load the library
at run-time, looking up function calls and data structures as needed.
Removes the need for a static dependency, but still requires a lot of shim
code to be written to load the shared library calls. C++'s name mangling
gets in the way a lot.

3. Swig
  Wrap a swig interface file that exposes whatever functionality we want to
golang. The standard go tool has builtin swig support, which knows when to
invoke the swig generator, in order to create go bindings that resemble the
C++ original. The build process is relatively uninterrupted.

I noticed there doesn't seem to be any swig in use currently in the arrow
codebase, which made me think there might have been a reason that it has
been avoided for other languages. I'm interested to hear any thoughts, or
see if there are other suggestions on how to proceed?

Regards,
Dustin


[jira] [Created] (ARROW-4084) Simplify Status and stringstream boilerplate

2018-12-19 Thread Francois Saint-Jacques (JIRA)
Francois Saint-Jacques created ARROW-4084:
-

 Summary: Simplify Status and stringstream boilerplate
 Key: ARROW-4084
 URL: https://issues.apache.org/jira/browse/ARROW-4084
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Francois Saint-Jacques
Assignee: Francois Saint-Jacques


There's a lot of stringstream repetition when creating a Status.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: How to append to parquet file periodically and read intermediate data - pyarrow.lib.ArrowIOError: Invalid parquet file. Corrupt footer.

2018-12-19 Thread Alberto Ramón
Some answers / ideas:

The typical: Write in Kafka
The Fashion: Pravega (from Apache Flink)
The Future: Wait to Erasure Code in HDFS 3

On Wed, 19 Dec 2018 at 16:41, Wes McKinney  wrote:

> We could certainly develop some tools in C++ and/or Python to assist
> with the compaction workflows. If you have an idea about how these
> might look and be generally useful, please feel free to propose in a
> JIRA issue
>
> On Wed, Dec 19, 2018 at 9:09 AM Joel Pfaff  wrote:
> >
> > Unfortunately I cannot use kudu in my projects, I would have loved to
> give
> > it a try. I did not know about hudi, it seems very similar to what we do
> > (Parquet + Avro), I will have a look.
> > I am following the iceberg project very closely, because it appears to
> > solve a lot of problems that we face on a regular basis.
> > I am really excited to learn that the arrow and iceberg projects could
> work
> > together and I can hope for a lot of good things coming out of these.
> >
> > On Wed, Dec 19, 2018 at 2:52 PM Uwe L. Korn  wrote:
> >
> > > This can also be solved by using a table format like
> > > https://github.com/uber/hudi or
> > > https://github.com/apache/incubator-iceberg where the latter has a PR
> > > open for a basic Python implementation with pyarrow.
> > >
> > > These table formats support using Avro and Parquet seamlessly together
> > > without the reader needing to take care of the storage format.
> > >
> > > Uwe
> > >
> > > > Am 19.12.2018 um 14:47 schrieb Wes McKinney :
> > > >
> > > > This turns out to be a very common problem (landing incremental
> > > > updates, dealing with compaction and small files). It's part of the
> > > > reason that systems like Apache Kudu were developed, e.g.
> > > >
> > > >
> > >
> https://blog.cloudera.com/blog/2015/11/how-to-ingest-and-query-fast-data-with-impala-without-kudu/
> > > >
> > > > If you have to use file storage, then figuring out a scheme to
> compact
> > > > Parquet files (e.g. once per hour, once per day) will definitely be
> > > > worth it compared with using a slower file format (like Avro)
> > > >
> > > > - Wes
> > > >
> > > >> On Wed, Dec 19, 2018 at 7:37 AM Joel Pfaff 
> > > wrote:
> > > >>
> > > >> Hello,
> > > >>
> > > >> For my company's usecases, we have found that the number of files
> was a
> > > >> critical part of the time spent doing the execution plan, so we
> found
> > > the
> > > >> idea of very regularly writing small parquet files to be rather
> > > inefficient.
> > > >>
> > > >> There are some formats that support an `append` semantic (I have
> tested
> > > >> successfully with avro, but there are a couple others that could be
> used
> > > >> similarly).
> > > >> So we had a few cases where we were aggregating data in a `current
> > > table`
> > > >> in set of avro files, and rewriting all of it in few parquet files
> at
> > > the
> > > >> end of the day.
> > > >> This allowed us to have files that have been prepared to optimize
> their
> > > >> querying performance (file size, row group size, sorting per
> column) by
> > > >> maximizing the ability to benefit from the statistics.
> > > >> And our queries were doing an UNION between "optimized for speed"
> > > history
> > > >> tables and "optimized for latency" current tables, when the query
> > > timeframe
> > > >> was crossing the boundaries of the current day.
> > > >>
> > > >> Regards, Joel
> > > >>
> > > >> On Wed, Dec 19, 2018 at 2:14 PM Francois Saint-Jacques <
> > > >> fsaintjacq...@networkdump.com> wrote:
> > > >>
> > > >>> Hello Darren,
> > > >>>
> > > >>> what Uwe suggests is usually the way to go, your active process
> writes
> > > to a
> > > >>> new file every time. Then you have a parallel process/thread that
> does
> > > >>> compaction of smaller files in the background such that you don't
> have
> > > too
> > > >>> many files.
> > > >>>
> > >  On Wed, Dec 19, 2018 at 7:59 AM Uwe L. Korn 
> wrote:
> > > 
> > >  Hello Darren,
> > > 
> > >  you're out of luck here. Parquet files are immutable and meant for
> > > batch
> > >  writes. Once they're written you cannot modify them anymore. To
> load
> > > >>> them,
> > >  you need to know their metadata which is in the footer. The
> footer is
> > >  always at the end of the file and written once you call close.
> > > 
> > >  Your use case is normally fulfilled by continously starting new
> files
> > > and
> > >  reading them back in using the ParquetDataset class
> > > 
> > >  Cheers
> > >  Uwe
> > > 
> > >  Am 18.12.2018 um 21:03 schrieb Darren Gallagher  >:
> > > 
> > > >> [Cross posted from https://github.com/apache/arrow/issues/3203]
> > > >>
> > > >> I'm adding new data to a parquet file every 60 seconds using
> this
> > > >>> code:
> > > >>
> > > >> import os
> > > >> import json
> > > >> import time
> > > >> import requests
> > > >> import pandas as pd
> > > >> import numpy as np
> > > >> import pyarrow as pa
> > > >> i

[jira] [Created] (ARROW-4083) [C++] Allowing ChunkedArrays to contain a mix of DictionaryArray and dense Array (of the dictionary type)

2018-12-19 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-4083:
---

 Summary: [C++] Allowing ChunkedArrays to contain a mix of 
DictionaryArray and dense Array (of the dictionary type)
 Key: ARROW-4083
 URL: https://issues.apache.org/jira/browse/ARROW-4083
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.13.0


In some applications we may receive a stream of some dictionary encoded data 
followed by some non-dictionary encoded data. For example this happens in 
Parquet files when the dictionary reaches a certain configurable size threshold.

We should think about how we can model this in our in-memory data structures, 
and how it can flow through to relevant computational components (i.e. certain 
data flow observers -- like an Aggregation -- might need to be able to process 
either a dense or dictionary encoded version of a particular array in the same 
stream)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4082) [C++] CMake tweaks: allow RelWithDebInfo, improve FindClangTools

2018-12-19 Thread Benjamin Kietzman (JIRA)
Benjamin Kietzman created ARROW-4082:


 Summary: [C++] CMake tweaks: allow RelWithDebInfo, improve 
FindClangTools
 Key: ARROW-4082
 URL: https://issues.apache.org/jira/browse/ARROW-4082
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Benjamin Kietzman


SetupCxxFlags.cmake does not list "RELWITHDEBINFO" in the [final flag 
setup|https://github.com/apache/arrow/blob/master/cpp/cmake_modules/SetupCxxFlags.cmake#L363],
 so cmake will error out if that build config is selected. It's handy for quick 
debugging without switching your python build etc over to "DEBUG".

FindClangTools.cmake could check the version of 'clang-format' (no version 
suffix) to see if it satisfies a version requirement.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4081) Sum methods on Mac OS X panic when the array is empty

2018-12-19 Thread Jonathan A Sternberg (JIRA)
Jonathan A Sternberg created ARROW-4081:
---

 Summary: Sum methods on Mac OS X panic when the array is empty
 Key: ARROW-4081
 URL: https://issues.apache.org/jira/browse/ARROW-4081
 Project: Apache Arrow
  Issue Type: Bug
  Components: Go
Reporter: Jonathan A Sternberg


If you create an empty array and use the `Sum` methods in the math package for 
the Go version, they will panic with an `index out of range` error.

(Will edit with a reproducer)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4080) [Rust] Improving lengthy build times in Appveyor

2018-12-19 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-4080:
---

 Summary: [Rust] Improving lengthy build times in Appveyor
 Key: ARROW-4080
 URL: https://issues.apache.org/jira/browse/ARROW-4080
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Wes McKinney
 Fix For: 0.13.0


The Rust build is the longest-running build in Appveyor now by 7-10 minutes, 
perhaps inflated more now because of the Parquet donation

https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/21118767

Would it be possible to split this into two builds, or otherwise improve the 
runtime?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: How to append to parquet file periodically and read intermediate data - pyarrow.lib.ArrowIOError: Invalid parquet file. Corrupt footer.

2018-12-19 Thread Wes McKinney
We could certainly develop some tools in C++ and/or Python to assist
with the compaction workflows. If you have an idea about how these
might look and be generally useful, please feel free to propose in a
JIRA issue

On Wed, Dec 19, 2018 at 9:09 AM Joel Pfaff  wrote:
>
> Unfortunately I cannot use kudu in my projects, I would have loved to give
> it a try. I did not know about hudi, it seems very similar to what we do
> (Parquet + Avro), I will have a look.
> I am following the iceberg project very closely, because it appears to
> solve a lot of problems that we face on a regular basis.
> I am really excited to learn that the arrow and iceberg projects could work
> together and I can hope for a lot of good things coming out of these.
>
> On Wed, Dec 19, 2018 at 2:52 PM Uwe L. Korn  wrote:
>
> > This can also be solved by using a table format like
> > https://github.com/uber/hudi or
> > https://github.com/apache/incubator-iceberg where the latter has a PR
> > open for a basic Python implementation with pyarrow.
> >
> > These table formats support using Avro and Parquet seamlessly together
> > without the reader needing to take care of the storage format.
> >
> > Uwe
> >
> > > Am 19.12.2018 um 14:47 schrieb Wes McKinney :
> > >
> > > This turns out to be a very common problem (landing incremental
> > > updates, dealing with compaction and small files). It's part of the
> > > reason that systems like Apache Kudu were developed, e.g.
> > >
> > >
> > https://blog.cloudera.com/blog/2015/11/how-to-ingest-and-query-fast-data-with-impala-without-kudu/
> > >
> > > If you have to use file storage, then figuring out a scheme to compact
> > > Parquet files (e.g. once per hour, once per day) will definitely be
> > > worth it compared with using a slower file format (like Avro)
> > >
> > > - Wes
> > >
> > >> On Wed, Dec 19, 2018 at 7:37 AM Joel Pfaff 
> > wrote:
> > >>
> > >> Hello,
> > >>
> > >> For my company's usecases, we have found that the number of files was a
> > >> critical part of the time spent doing the execution plan, so we found
> > the
> > >> idea of very regularly writing small parquet files to be rather
> > inefficient.
> > >>
> > >> There are some formats that support an `append` semantic (I have tested
> > >> successfully with avro, but there are a couple others that could be used
> > >> similarly).
> > >> So we had a few cases where we were aggregating data in a `current
> > table`
> > >> in set of avro files, and rewriting all of it in few parquet files at
> > the
> > >> end of the day.
> > >> This allowed us to have files that have been prepared to optimize their
> > >> querying performance (file size, row group size, sorting per column) by
> > >> maximizing the ability to benefit from the statistics.
> > >> And our queries were doing an UNION between "optimized for speed"
> > history
> > >> tables and "optimized for latency" current tables, when the query
> > timeframe
> > >> was crossing the boundaries of the current day.
> > >>
> > >> Regards, Joel
> > >>
> > >> On Wed, Dec 19, 2018 at 2:14 PM Francois Saint-Jacques <
> > >> fsaintjacq...@networkdump.com> wrote:
> > >>
> > >>> Hello Darren,
> > >>>
> > >>> what Uwe suggests is usually the way to go, your active process writes
> > to a
> > >>> new file every time. Then you have a parallel process/thread that does
> > >>> compaction of smaller files in the background such that you don't have
> > too
> > >>> many files.
> > >>>
> >  On Wed, Dec 19, 2018 at 7:59 AM Uwe L. Korn  wrote:
> > 
> >  Hello Darren,
> > 
> >  you're out of luck here. Parquet files are immutable and meant for
> > batch
> >  writes. Once they're written you cannot modify them anymore. To load
> > >>> them,
> >  you need to know their metadata which is in the footer. The footer is
> >  always at the end of the file and written once you call close.
> > 
> >  Your use case is normally fulfilled by continously starting new files
> > and
> >  reading them back in using the ParquetDataset class
> > 
> >  Cheers
> >  Uwe
> > 
> >  Am 18.12.2018 um 21:03 schrieb Darren Gallagher :
> > 
> > >> [Cross posted from https://github.com/apache/arrow/issues/3203]
> > >>
> > >> I'm adding new data to a parquet file every 60 seconds using this
> > >>> code:
> > >>
> > >> import os
> > >> import json
> > >> import time
> > >> import requests
> > >> import pandas as pd
> > >> import numpy as np
> > >> import pyarrow as pa
> > >> import pyarrow.parquet as pq
> > >>
> > >> api_url = 'https://opensky-network.org/api/states/all'
> > >>
> > >> cols = ['icao24', 'callsign', 'origin', 'time_position',
> > >>   'last_contact', 'longitude', 'latitude',
> > >>   'baro_altitude', 'on_ground', 'velocity', 'true_track',
> > >>   'vertical_rate', 'sensors', 'geo_altitude', 'squawk',
> > >>   'spi', 'position_source']
> > >>
> > >> def get_ne

[jira] [Created] (ARROW-4079) [C++] Add machine benchmarks

2018-12-19 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-4079:
-

 Summary: [C++] Add machine benchmarks
 Key: ARROW-4079
 URL: https://issues.apache.org/jira/browse/ARROW-4079
 Project: Apache Arrow
  Issue Type: Wish
  Components: C++
Affects Versions: 0.11.1
Reporter: Antoine Pitrou


I wonder if it may be useful to add machine benchmarks. I have a cache/memory 
latency benchmark lying around, we could also add e.g. memory bandwidth 
benchmarks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: How to append to parquet file periodically and read intermediate data - pyarrow.lib.ArrowIOError: Invalid parquet file. Corrupt footer.

2018-12-19 Thread Joel Pfaff
Unfortunately I cannot use kudu in my projects, I would have loved to give
it a try. I did not know about hudi, it seems very similar to what we do
(Parquet + Avro), I will have a look.
I am following the iceberg project very closely, because it appears to
solve a lot of problems that we face on a regular basis.
I am really excited to learn that the arrow and iceberg projects could work
together and I can hope for a lot of good things coming out of these.

On Wed, Dec 19, 2018 at 2:52 PM Uwe L. Korn  wrote:

> This can also be solved by using a table format like
> https://github.com/uber/hudi or
> https://github.com/apache/incubator-iceberg where the latter has a PR
> open for a basic Python implementation with pyarrow.
>
> These table formats support using Avro and Parquet seamlessly together
> without the reader needing to take care of the storage format.
>
> Uwe
>
> > Am 19.12.2018 um 14:47 schrieb Wes McKinney :
> >
> > This turns out to be a very common problem (landing incremental
> > updates, dealing with compaction and small files). It's part of the
> > reason that systems like Apache Kudu were developed, e.g.
> >
> >
> https://blog.cloudera.com/blog/2015/11/how-to-ingest-and-query-fast-data-with-impala-without-kudu/
> >
> > If you have to use file storage, then figuring out a scheme to compact
> > Parquet files (e.g. once per hour, once per day) will definitely be
> > worth it compared with using a slower file format (like Avro)
> >
> > - Wes
> >
> >> On Wed, Dec 19, 2018 at 7:37 AM Joel Pfaff 
> wrote:
> >>
> >> Hello,
> >>
> >> For my company's usecases, we have found that the number of files was a
> >> critical part of the time spent doing the execution plan, so we found
> the
> >> idea of very regularly writing small parquet files to be rather
> inefficient.
> >>
> >> There are some formats that support an `append` semantic (I have tested
> >> successfully with avro, but there are a couple others that could be used
> >> similarly).
> >> So we had a few cases where we were aggregating data in a `current
> table`
> >> in set of avro files, and rewriting all of it in few parquet files at
> the
> >> end of the day.
> >> This allowed us to have files that have been prepared to optimize their
> >> querying performance (file size, row group size, sorting per column) by
> >> maximizing the ability to benefit from the statistics.
> >> And our queries were doing an UNION between "optimized for speed"
> history
> >> tables and "optimized for latency" current tables, when the query
> timeframe
> >> was crossing the boundaries of the current day.
> >>
> >> Regards, Joel
> >>
> >> On Wed, Dec 19, 2018 at 2:14 PM Francois Saint-Jacques <
> >> fsaintjacq...@networkdump.com> wrote:
> >>
> >>> Hello Darren,
> >>>
> >>> what Uwe suggests is usually the way to go, your active process writes
> to a
> >>> new file every time. Then you have a parallel process/thread that does
> >>> compaction of smaller files in the background such that you don't have
> too
> >>> many files.
> >>>
>  On Wed, Dec 19, 2018 at 7:59 AM Uwe L. Korn  wrote:
> 
>  Hello Darren,
> 
>  you're out of luck here. Parquet files are immutable and meant for
> batch
>  writes. Once they're written you cannot modify them anymore. To load
> >>> them,
>  you need to know their metadata which is in the footer. The footer is
>  always at the end of the file and written once you call close.
> 
>  Your use case is normally fulfilled by continously starting new files
> and
>  reading them back in using the ParquetDataset class
> 
>  Cheers
>  Uwe
> 
>  Am 18.12.2018 um 21:03 schrieb Darren Gallagher :
> 
> >> [Cross posted from https://github.com/apache/arrow/issues/3203]
> >>
> >> I'm adding new data to a parquet file every 60 seconds using this
> >>> code:
> >>
> >> import os
> >> import json
> >> import time
> >> import requests
> >> import pandas as pd
> >> import numpy as np
> >> import pyarrow as pa
> >> import pyarrow.parquet as pq
> >>
> >> api_url = 'https://opensky-network.org/api/states/all'
> >>
> >> cols = ['icao24', 'callsign', 'origin', 'time_position',
> >>   'last_contact', 'longitude', 'latitude',
> >>   'baro_altitude', 'on_ground', 'velocity', 'true_track',
> >>   'vertical_rate', 'sensors', 'geo_altitude', 'squawk',
> >>   'spi', 'position_source']
> >>
> >> def get_new_flight_info(writer):
> >>   print("Requesting new data")
> >>   req = requests.get(api_url)
> >>   content = req.json()
> >>
> >>   states = content['states']
> >>   df = pd.DataFrame(states, columns = cols)
> >>   df['timestamp'] = content['time']
> >>   print("Found {} new items".format(len(df)))
> >>
> >>   table = pa.Table.from_pandas(df)
> >>   if writer is None:
> >>   writer = pq.ParquetWriter('openskyflights.parquet',
> >>> table.schema)
> >>

[jira] [Created] (ARROW-4078) [CI] Need separate doc building job

2018-12-19 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-4078:
-

 Summary: [CI] Need separate doc building job
 Key: ARROW-4078
 URL: https://issues.apache.org/jira/browse/ARROW-4078
 Project: Apache Arrow
  Issue Type: Bug
  Components: Continuous Integration, Documentation
Reporter: Antoine Pitrou


When only changes to the {{docs}} directory are made, most Travis jobs are 
skipped, even the Python job which (presumably) builds the documentation to 
check for errors etc.

We should probably have a separate doc building job, or perhaps make it part of 
the linting job.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Reviewing PRs (was: Re: Arrow sync call)

2018-12-19 Thread Krisztián Szűcs
We now have nightly docs build:
https://github.com/kszucs/crossbow/branches/all?utf8=%E2%9C%93&query=docs
If We decide where to upload it, We can publish nightly dev docs.

On Wed, Dec 19, 2018 at 3:12 PM Wes McKinney  wrote:

> Indeed. I had opened an issue about this some time ago
>
> https://issues.apache.org/jira/browse/ARROW-1299
>
> On Wed, Dec 19, 2018 at 8:10 AM Antoine Pitrou  wrote:
> >
> >
> > Le 19/12/2018 à 15:07, Wes McKinney a écrit :
> > > +1 also. The C++ README has grown quite long, for example. Probably to
> > > put all of that in the Sphinx project.
> > >
> > > One downside of Sphinx is that some things can grow out of date on the
> > > website in between releases. Within the codebase itself, we can remedy
> > > this by directing people to the .rst files rather than the website
> >
> > Ideally we would provide a "dev" (i.e. git master) doc build in addition
> > to the doc build for the latest release.
> >
> > Regards
> >
> > Antoine.
>


Re: Reviewing PRs (was: Re: Arrow sync call)

2018-12-19 Thread Wes McKinney
Indeed. I had opened an issue about this some time ago

https://issues.apache.org/jira/browse/ARROW-1299

On Wed, Dec 19, 2018 at 8:10 AM Antoine Pitrou  wrote:
>
>
> Le 19/12/2018 à 15:07, Wes McKinney a écrit :
> > +1 also. The C++ README has grown quite long, for example. Probably to
> > put all of that in the Sphinx project.
> >
> > One downside of Sphinx is that some things can grow out of date on the
> > website in between releases. Within the codebase itself, we can remedy
> > this by directing people to the .rst files rather than the website
>
> Ideally we would provide a "dev" (i.e. git master) doc build in addition
> to the doc build for the latest release.
>
> Regards
>
> Antoine.


Re: Reviewing PRs (was: Re: Arrow sync call)

2018-12-19 Thread Antoine Pitrou


Le 19/12/2018 à 15:07, Wes McKinney a écrit :
> +1 also. The C++ README has grown quite long, for example. Probably to
> put all of that in the Sphinx project.
> 
> One downside of Sphinx is that some things can grow out of date on the
> website in between releases. Within the codebase itself, we can remedy
> this by directing people to the .rst files rather than the website

Ideally we would provide a "dev" (i.e. git master) doc build in addition
to the doc build for the latest release.

Regards

Antoine.


Re: Reviewing PRs (was: Re: Arrow sync call)

2018-12-19 Thread Wes McKinney
+1 also. The C++ README has grown quite long, for example. Probably to
put all of that in the Sphinx project.

One downside of Sphinx is that some things can grow out of date on the
website in between releases. Within the codebase itself, we can remedy
this by directing people to the .rst files rather than the website

On Wed, Dec 19, 2018 at 5:36 AM Uwe L. Korn  wrote:
>
> +1, I would also like to see them in Sphinx.
>
> Uwe
>
> > Am 19.12.2018 um 11:13 schrieb Antoine Pitrou :
> >
> >
> > We should decide where we want to put developer docs.
> >
> > I would favour putting them in the Sphinx docs, personally.
> >
> > Regards
> >
> > Antoine.
> >
> >
> >> Le 19/12/2018 à 02:20, Wes McKinney a écrit :
> >> Some projects have a REVIEWERS.md file
> >>
> >> https://github.com/apache/parquet-mr/blob/master/parquet-common/REVIEWERS.md
> >>
> >> We could do the same, or keep the file on the project wiki so it's
> >> lighter-weight to change (no pull request required)
> >>
> >> https://cwiki.apache.org/confluence/display/ARROW
> >>
> >> +1 for adding labels to PRs in any case. We use the [COMPONENT] naming
> >> in the title so people can set up e-mail filters (the GitHub labels
> >> don't come through in their e-mail notification AFAICT)
> >>
> >>> On Tue, Dec 18, 2018 at 1:10 AM Chao Sun  wrote:
> >>>
> >>> +1 on adding labels for languages, review states, components, etc. This
> >>> makes it much easier to filter PRs.
> >>>
> >>> Chao
> >>>
> >>> On Wed, Dec 12, 2018 at 11:54 AM Krisztián Szűcs 
> >>> 
> >>> wrote:
> >>>
>  Create a new one and set arrow-xxx as parent:
>  [image: image.png]
> 
> > On Wed, Dec 12, 2018 at 7:46 PM Antoine Pitrou  
> > wrote:
> >
> >
> > Apparently it's possible to create GitHub teams inside the Apache
> > organization ourselves.  I've just created a dummy one:
> > https://github.com/orgs/apache/teams/arrow-xxx/members
> >
> > However, I cannot create a child team inside of the arrow-committers
> > team.  The button "Add a team" here is grayed out:
> > https://github.com/orgs/apache/teams/arrow-committers/teams
> >
> > Regards
> >
> > Antoine.
> >
> >
> >> Le 12/12/2018 à 19:40, Krisztián Szűcs a écrit :
> >> I like the GitHub teams approach. Do We need to ask INFRA to create
> > them?
> >>
> >>> On Wed, Dec 12, 2018, 7:28 PM Sebastien Binet  >>>
> >>> On Wed, Dec 12, 2018 at 7:25 PM Antoine Pitrou 
> > wrote:
> >>>
> 
>  Hi,
> 
>  Now that we have a lot of different implementations and a growing
> > number
>  of assorted topics, it becomes hard to know whether a PR or issue has
> > a
>  dedicated expert or would benefit from an outsider look.
> 
>  In Python we have what we call the "experts" list which is a 
>  per-topic
>  (or per-library module) contributors who are generally interested in
> > and
>  competent on such topic (*).  So it's possible to cc such a person, 
>  or
>  if no expert is available on a given topic, perhaps for someone else
> > to
>  try and have a look anyway.  Perhaps we need something similar for
> > Arrow?
> 
> >>>
> >>> with github, one can also create "teams" and "@" them.
> >>> we could perhaps create @arrow-py, @arrow-cxx, @arrow-go, ...
> >>> this dilutes a bit responsibilities but also reduces a bit the net
> > that's
> >>> cast.
> >>>
> >>> -s
> >>>
> >>>
>  (*) https://devguide.python.org/experts/
> 
>  Regards
> 
>  Antoine.
> 
> 
> 
> > Le 12/12/2018 à 19:13, Ravindra Pindikura a écrit :
> > Attendees : Wes, Sidd, Bryan, Francois, Hatem, Nick, Shyam, 
> > Ravindra,
>  Matt
> >
> > Wes:
> > - do not rush the 0.12 release before the holidays, instead target
> > the
>  release for early next year
> > - request everyone to look at PRs in the queue, and help by doing
> >>> reviews
> >
> > Wes/Nick
> > - queried about Interest in developing a "dataset abstraction" as a
>  layer above file readers that arrow now supports (parquet, csv, json)
> >
> > Sidd
> > - agreed to be the release manager for 0.12
> > - things to keep in mind for release managers :
> > 1. We now use crossbow to automate the building of binaries with CI
> > 2. From this release, the binary artifacts will be hosted in bintray
>  instead of apache dist since the size has increased significantly
> >
> > Hatem
> > - Asked about documentation regarding IDE for setup/debug of arrow
>  libraries
> > - Wes pointed out the developer wiki on confluence. Hatem offered to
>  help with documentation.
> >
> > Than

Re: Arrow pull requests: please limit squashing your commits

2018-12-19 Thread Wes McKinney
On Wed, Dec 19, 2018 at 7:47 AM Antoine Pitrou  wrote:
>
>
> Le 19/12/2018 à 14:42, Wes McKinney a écrit :
> >
> > * Our PR merge tool (dev/merge_arrow_py.py) squashes all the commits
> > anyway, so squashing twice is redundant
>
> The problem is you can then get spurious conflicts if you base a PR on
> another.  Happened to me several times.

Agreed -- I didn't say "never squash" but to "avoid" or "limit" it.
The stacked PR use case is a good example where things can be painful
if all your commits are not atomic. This does not describe the average
pull request, though

>
> Regards
>
> Antoine.


Re: How to append to parquet file periodically and read intermediate data - pyarrow.lib.ArrowIOError: Invalid parquet file. Corrupt footer.

2018-12-19 Thread Uwe L. Korn
This can also be solved by using a table format like 
https://github.com/uber/hudi or https://github.com/apache/incubator-iceberg 
where the latter has a PR open for a basic Python implementation with pyarrow. 

These table formats support using Avro and Parquet seamlessly together without 
the reader needing to take care of the storage format.

Uwe

> Am 19.12.2018 um 14:47 schrieb Wes McKinney :
> 
> This turns out to be a very common problem (landing incremental
> updates, dealing with compaction and small files). It's part of the
> reason that systems like Apache Kudu were developed, e.g.
> 
> https://blog.cloudera.com/blog/2015/11/how-to-ingest-and-query-fast-data-with-impala-without-kudu/
> 
> If you have to use file storage, then figuring out a scheme to compact
> Parquet files (e.g. once per hour, once per day) will definitely be
> worth it compared with using a slower file format (like Avro)
> 
> - Wes
> 
>> On Wed, Dec 19, 2018 at 7:37 AM Joel Pfaff  wrote:
>> 
>> Hello,
>> 
>> For my company's usecases, we have found that the number of files was a
>> critical part of the time spent doing the execution plan, so we found the
>> idea of very regularly writing small parquet files to be rather inefficient.
>> 
>> There are some formats that support an `append` semantic (I have tested
>> successfully with avro, but there are a couple others that could be used
>> similarly).
>> So we had a few cases where we were aggregating data in a `current table`
>> in set of avro files, and rewriting all of it in few parquet files at the
>> end of the day.
>> This allowed us to have files that have been prepared to optimize their
>> querying performance (file size, row group size, sorting per column) by
>> maximizing the ability to benefit from the statistics.
>> And our queries were doing an UNION between "optimized for speed" history
>> tables and "optimized for latency" current tables, when the query timeframe
>> was crossing the boundaries of the current day.
>> 
>> Regards, Joel
>> 
>> On Wed, Dec 19, 2018 at 2:14 PM Francois Saint-Jacques <
>> fsaintjacq...@networkdump.com> wrote:
>> 
>>> Hello Darren,
>>> 
>>> what Uwe suggests is usually the way to go, your active process writes to a
>>> new file every time. Then you have a parallel process/thread that does
>>> compaction of smaller files in the background such that you don't have too
>>> many files.
>>> 
 On Wed, Dec 19, 2018 at 7:59 AM Uwe L. Korn  wrote:
 
 Hello Darren,
 
 you're out of luck here. Parquet files are immutable and meant for batch
 writes. Once they're written you cannot modify them anymore. To load
>>> them,
 you need to know their metadata which is in the footer. The footer is
 always at the end of the file and written once you call close.
 
 Your use case is normally fulfilled by continously starting new files and
 reading them back in using the ParquetDataset class
 
 Cheers
 Uwe
 
 Am 18.12.2018 um 21:03 schrieb Darren Gallagher :
 
>> [Cross posted from https://github.com/apache/arrow/issues/3203]
>> 
>> I'm adding new data to a parquet file every 60 seconds using this
>>> code:
>> 
>> import os
>> import json
>> import time
>> import requests
>> import pandas as pd
>> import numpy as np
>> import pyarrow as pa
>> import pyarrow.parquet as pq
>> 
>> api_url = 'https://opensky-network.org/api/states/all'
>> 
>> cols = ['icao24', 'callsign', 'origin', 'time_position',
>>   'last_contact', 'longitude', 'latitude',
>>   'baro_altitude', 'on_ground', 'velocity', 'true_track',
>>   'vertical_rate', 'sensors', 'geo_altitude', 'squawk',
>>   'spi', 'position_source']
>> 
>> def get_new_flight_info(writer):
>>   print("Requesting new data")
>>   req = requests.get(api_url)
>>   content = req.json()
>> 
>>   states = content['states']
>>   df = pd.DataFrame(states, columns = cols)
>>   df['timestamp'] = content['time']
>>   print("Found {} new items".format(len(df)))
>> 
>>   table = pa.Table.from_pandas(df)
>>   if writer is None:
>>   writer = pq.ParquetWriter('openskyflights.parquet',
>>> table.schema)
>>   writer.write_table(table=table)
>>   return writer
>> 
>> if __name__ == '__main__':
>>   writer = None
>>   while (not os.path.exists('opensky.STOP')):
>>   writer = get_new_flight_info(writer)
>>   time.sleep(60)
>> 
>>   if writer:
>>   writer.close()
>> 
>> This is working fine and the file grows every 60 seconds.
>> However unless I force the loop to exit I am unable to use the parquet
>> file. In a separate terminal I try to access the parquet file using
>>> this
>> code:
>> 
>> import pandas as pd
>> import pyarrow.parquet as pq
>> 
>> table = pq.read_table("openskyflights.parquet")
>> df = table.to_pandas()
>>>

Re: Arrow pull requests: please limit squashing your commits

2018-12-19 Thread Wes McKinney
On Wed, Dec 19, 2018 at 7:47 AM Francois Saint-Jacques
 wrote:
>
> No issue with this.
>
> When the final squash is done, which title/body is preserved?

The PR title (in GitHub) and the PR description are what matter. The
commit messages don't really matter

>
> On Wed, Dec 19, 2018 at 8:43 AM Wes McKinney  wrote:
>
> > hi folks,
> >
> > As the contributor base has grown, our development styles have grown
> > increasingly diverse.
> >
> > Sometimes contributors are used to working in a Gerrit-style workflow
> > where patches are always squashed with `git rebase -i` into a single
> > patch, and then force pushed to the PR branch.
> >
> > I'd like to ask you to avoid doing this, as it can make things harder
> > for maintainers. Let me explain:
> >
> > * When you rebase and force-push, GitHub fails to generate an e-mail
> > notification. I use the GitHub notifications to tell which branches
> > are being actively developed and may need to be reviewed again. Many
> > times now I have thought a branch was inactive only to look more
> > closely and see that it's been updated via force-push. Since it took
> > GitHub 10 years to start showing force push changes at all in their UI
> > I'm not holding out for them to send e-mail notifications about this
> >
> > * GitHub is not Gerrit. We don't have the awesome incremental diff
> > feature. So in lieu of this it's easier to be able to look at
> > incremental diffs (e.g. responses to code review comments) by clicking
> > on the individual commits
> >
> > * Our PR merge tool (dev/merge_arrow_py.py) squashes all the commits
> > anyway, so squashing twice is redundant
> >
> > Sometimes I'll have commits like this in my branch
> >
> > * IMPLEMENTING THE FEATURE
> > * lint
> > * fixing CI
> > * fixing toolchain issue
> > * code review commits
> > * fixing CI issues
> > * more code review comments
> > * documentation
> >
> > I think it's fine to combine some of the commits this so long as the
> > produced commits reflect the logical evolution of your patch, for the
> > purposes of code review.
> >
> > In the event of a gnarly rebase on master, sometimes it is easiest to
> > create a single commit and then rebase that.
> >
> > Thanks,
> > Wes
> >


Re: How to append to parquet file periodically and read intermediate data - pyarrow.lib.ArrowIOError: Invalid parquet file. Corrupt footer.

2018-12-19 Thread Wes McKinney
This turns out to be a very common problem (landing incremental
updates, dealing with compaction and small files). It's part of the
reason that systems like Apache Kudu were developed, e.g.

https://blog.cloudera.com/blog/2015/11/how-to-ingest-and-query-fast-data-with-impala-without-kudu/

If you have to use file storage, then figuring out a scheme to compact
Parquet files (e.g. once per hour, once per day) will definitely be
worth it compared with using a slower file format (like Avro)

- Wes

On Wed, Dec 19, 2018 at 7:37 AM Joel Pfaff  wrote:
>
> Hello,
>
> For my company's usecases, we have found that the number of files was a
> critical part of the time spent doing the execution plan, so we found the
> idea of very regularly writing small parquet files to be rather inefficient.
>
> There are some formats that support an `append` semantic (I have tested
> successfully with avro, but there are a couple others that could be used
> similarly).
> So we had a few cases where we were aggregating data in a `current table`
> in set of avro files, and rewriting all of it in few parquet files at the
> end of the day.
> This allowed us to have files that have been prepared to optimize their
> querying performance (file size, row group size, sorting per column) by
> maximizing the ability to benefit from the statistics.
> And our queries were doing an UNION between "optimized for speed" history
> tables and "optimized for latency" current tables, when the query timeframe
> was crossing the boundaries of the current day.
>
> Regards, Joel
>
> On Wed, Dec 19, 2018 at 2:14 PM Francois Saint-Jacques <
> fsaintjacq...@networkdump.com> wrote:
>
> > Hello Darren,
> >
> > what Uwe suggests is usually the way to go, your active process writes to a
> > new file every time. Then you have a parallel process/thread that does
> > compaction of smaller files in the background such that you don't have too
> > many files.
> >
> > On Wed, Dec 19, 2018 at 7:59 AM Uwe L. Korn  wrote:
> >
> > > Hello Darren,
> > >
> > > you're out of luck here. Parquet files are immutable and meant for batch
> > > writes. Once they're written you cannot modify them anymore. To load
> > them,
> > > you need to know their metadata which is in the footer. The footer is
> > > always at the end of the file and written once you call close.
> > >
> > > Your use case is normally fulfilled by continously starting new files and
> > > reading them back in using the ParquetDataset class
> > >
> > > Cheers
> > > Uwe
> > >
> > > Am 18.12.2018 um 21:03 schrieb Darren Gallagher :
> > >
> > > >> [Cross posted from https://github.com/apache/arrow/issues/3203]
> > > >>
> > > >> I'm adding new data to a parquet file every 60 seconds using this
> > code:
> > > >>
> > > >> import os
> > > >> import json
> > > >> import time
> > > >> import requests
> > > >> import pandas as pd
> > > >> import numpy as np
> > > >> import pyarrow as pa
> > > >> import pyarrow.parquet as pq
> > > >>
> > > >> api_url = 'https://opensky-network.org/api/states/all'
> > > >>
> > > >> cols = ['icao24', 'callsign', 'origin', 'time_position',
> > > >>'last_contact', 'longitude', 'latitude',
> > > >>'baro_altitude', 'on_ground', 'velocity', 'true_track',
> > > >>'vertical_rate', 'sensors', 'geo_altitude', 'squawk',
> > > >>'spi', 'position_source']
> > > >>
> > > >> def get_new_flight_info(writer):
> > > >>print("Requesting new data")
> > > >>req = requests.get(api_url)
> > > >>content = req.json()
> > > >>
> > > >>states = content['states']
> > > >>df = pd.DataFrame(states, columns = cols)
> > > >>df['timestamp'] = content['time']
> > > >>print("Found {} new items".format(len(df)))
> > > >>
> > > >>table = pa.Table.from_pandas(df)
> > > >>if writer is None:
> > > >>writer = pq.ParquetWriter('openskyflights.parquet',
> > table.schema)
> > > >>writer.write_table(table=table)
> > > >>return writer
> > > >>
> > > >> if __name__ == '__main__':
> > > >>writer = None
> > > >>while (not os.path.exists('opensky.STOP')):
> > > >>writer = get_new_flight_info(writer)
> > > >>time.sleep(60)
> > > >>
> > > >>if writer:
> > > >>writer.close()
> > > >>
> > > >> This is working fine and the file grows every 60 seconds.
> > > >> However unless I force the loop to exit I am unable to use the parquet
> > > >> file. In a separate terminal I try to access the parquet file using
> > this
> > > >> code:
> > > >>
> > > >> import pandas as pd
> > > >> import pyarrow.parquet as pq
> > > >>
> > > >> table = pq.read_table("openskyflights.parquet")
> > > >> df = table.to_pandas()
> > > >> print(len(df))
> > > >>
> > > >> which results in this error:
> > > >>
> > > >> Traceback (most recent call last):
> > > >>  File "checkdownloadsize.py", line 7, in 
> > > >>table = pq.read_table("openskyflights.parquet")
> > > >>  File
> > >
> > "/home//.local/share/virtualenvs/opensky-WcPvsoLj/lib/python3.5/site-packa

Re: Arrow pull requests: please limit squashing your commits

2018-12-19 Thread Antoine Pitrou


Le 19/12/2018 à 14:42, Wes McKinney a écrit :
> 
> * Our PR merge tool (dev/merge_arrow_py.py) squashes all the commits
> anyway, so squashing twice is redundant

The problem is you can then get spurious conflicts if you base a PR on
another.  Happened to me several times.

Regards

Antoine.


Re: Arrow pull requests: please limit squashing your commits

2018-12-19 Thread Francois Saint-Jacques
No issue with this.

When the final squash is done, which title/body is preserved?

On Wed, Dec 19, 2018 at 8:43 AM Wes McKinney  wrote:

> hi folks,
>
> As the contributor base has grown, our development styles have grown
> increasingly diverse.
>
> Sometimes contributors are used to working in a Gerrit-style workflow
> where patches are always squashed with `git rebase -i` into a single
> patch, and then force pushed to the PR branch.
>
> I'd like to ask you to avoid doing this, as it can make things harder
> for maintainers. Let me explain:
>
> * When you rebase and force-push, GitHub fails to generate an e-mail
> notification. I use the GitHub notifications to tell which branches
> are being actively developed and may need to be reviewed again. Many
> times now I have thought a branch was inactive only to look more
> closely and see that it's been updated via force-push. Since it took
> GitHub 10 years to start showing force push changes at all in their UI
> I'm not holding out for them to send e-mail notifications about this
>
> * GitHub is not Gerrit. We don't have the awesome incremental diff
> feature. So in lieu of this it's easier to be able to look at
> incremental diffs (e.g. responses to code review comments) by clicking
> on the individual commits
>
> * Our PR merge tool (dev/merge_arrow_py.py) squashes all the commits
> anyway, so squashing twice is redundant
>
> Sometimes I'll have commits like this in my branch
>
> * IMPLEMENTING THE FEATURE
> * lint
> * fixing CI
> * fixing toolchain issue
> * code review commits
> * fixing CI issues
> * more code review comments
> * documentation
>
> I think it's fine to combine some of the commits this so long as the
> produced commits reflect the logical evolution of your patch, for the
> purposes of code review.
>
> In the event of a gnarly rebase on master, sometimes it is easiest to
> create a single commit and then rebase that.
>
> Thanks,
> Wes
>


[jira] [Created] (ARROW-4077) [Gandiva] fix CI if ctest doesn't run any tests

2018-12-19 Thread Pindikura Ravindra (JIRA)
Pindikura Ravindra created ARROW-4077:
-

 Summary: [Gandiva] fix CI if ctest doesn't run any tests
 Key: ARROW-4077
 URL: https://issues.apache.org/jira/browse/ARROW-4077
 Project: Apache Arrow
  Issue Type: Bug
  Components: Gandiva
Reporter: Pindikura Ravindra


This has happened a couple of times already due to changes in 
build/flags/labels and it's hard to figure out unless we look into the travis 
output carefully.

Instead, travis_script_gandiva_cpp.sh should terminate with a non-zero error if 
no tests are run.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Arrow pull requests: please limit squashing your commits

2018-12-19 Thread Wes McKinney
hi folks,

As the contributor base has grown, our development styles have grown
increasingly diverse.

Sometimes contributors are used to working in a Gerrit-style workflow
where patches are always squashed with `git rebase -i` into a single
patch, and then force pushed to the PR branch.

I'd like to ask you to avoid doing this, as it can make things harder
for maintainers. Let me explain:

* When you rebase and force-push, GitHub fails to generate an e-mail
notification. I use the GitHub notifications to tell which branches
are being actively developed and may need to be reviewed again. Many
times now I have thought a branch was inactive only to look more
closely and see that it's been updated via force-push. Since it took
GitHub 10 years to start showing force push changes at all in their UI
I'm not holding out for them to send e-mail notifications about this

* GitHub is not Gerrit. We don't have the awesome incremental diff
feature. So in lieu of this it's easier to be able to look at
incremental diffs (e.g. responses to code review comments) by clicking
on the individual commits

* Our PR merge tool (dev/merge_arrow_py.py) squashes all the commits
anyway, so squashing twice is redundant

Sometimes I'll have commits like this in my branch

* IMPLEMENTING THE FEATURE
* lint
* fixing CI
* fixing toolchain issue
* code review commits
* fixing CI issues
* more code review comments
* documentation

I think it's fine to combine some of the commits this so long as the
produced commits reflect the logical evolution of your patch, for the
purposes of code review.

In the event of a gnarly rebase on master, sometimes it is easiest to
create a single commit and then rebase that.

Thanks,
Wes


Re: How to append to parquet file periodically and read intermediate data - pyarrow.lib.ArrowIOError: Invalid parquet file. Corrupt footer.

2018-12-19 Thread Joel Pfaff
Hello,

For my company's usecases, we have found that the number of files was a
critical part of the time spent doing the execution plan, so we found the
idea of very regularly writing small parquet files to be rather inefficient.

There are some formats that support an `append` semantic (I have tested
successfully with avro, but there are a couple others that could be used
similarly).
So we had a few cases where we were aggregating data in a `current table`
in set of avro files, and rewriting all of it in few parquet files at the
end of the day.
This allowed us to have files that have been prepared to optimize their
querying performance (file size, row group size, sorting per column) by
maximizing the ability to benefit from the statistics.
And our queries were doing an UNION between "optimized for speed" history
tables and "optimized for latency" current tables, when the query timeframe
was crossing the boundaries of the current day.

Regards, Joel

On Wed, Dec 19, 2018 at 2:14 PM Francois Saint-Jacques <
fsaintjacq...@networkdump.com> wrote:

> Hello Darren,
>
> what Uwe suggests is usually the way to go, your active process writes to a
> new file every time. Then you have a parallel process/thread that does
> compaction of smaller files in the background such that you don't have too
> many files.
>
> On Wed, Dec 19, 2018 at 7:59 AM Uwe L. Korn  wrote:
>
> > Hello Darren,
> >
> > you're out of luck here. Parquet files are immutable and meant for batch
> > writes. Once they're written you cannot modify them anymore. To load
> them,
> > you need to know their metadata which is in the footer. The footer is
> > always at the end of the file and written once you call close.
> >
> > Your use case is normally fulfilled by continously starting new files and
> > reading them back in using the ParquetDataset class
> >
> > Cheers
> > Uwe
> >
> > Am 18.12.2018 um 21:03 schrieb Darren Gallagher :
> >
> > >> [Cross posted from https://github.com/apache/arrow/issues/3203]
> > >>
> > >> I'm adding new data to a parquet file every 60 seconds using this
> code:
> > >>
> > >> import os
> > >> import json
> > >> import time
> > >> import requests
> > >> import pandas as pd
> > >> import numpy as np
> > >> import pyarrow as pa
> > >> import pyarrow.parquet as pq
> > >>
> > >> api_url = 'https://opensky-network.org/api/states/all'
> > >>
> > >> cols = ['icao24', 'callsign', 'origin', 'time_position',
> > >>'last_contact', 'longitude', 'latitude',
> > >>'baro_altitude', 'on_ground', 'velocity', 'true_track',
> > >>'vertical_rate', 'sensors', 'geo_altitude', 'squawk',
> > >>'spi', 'position_source']
> > >>
> > >> def get_new_flight_info(writer):
> > >>print("Requesting new data")
> > >>req = requests.get(api_url)
> > >>content = req.json()
> > >>
> > >>states = content['states']
> > >>df = pd.DataFrame(states, columns = cols)
> > >>df['timestamp'] = content['time']
> > >>print("Found {} new items".format(len(df)))
> > >>
> > >>table = pa.Table.from_pandas(df)
> > >>if writer is None:
> > >>writer = pq.ParquetWriter('openskyflights.parquet',
> table.schema)
> > >>writer.write_table(table=table)
> > >>return writer
> > >>
> > >> if __name__ == '__main__':
> > >>writer = None
> > >>while (not os.path.exists('opensky.STOP')):
> > >>writer = get_new_flight_info(writer)
> > >>time.sleep(60)
> > >>
> > >>if writer:
> > >>writer.close()
> > >>
> > >> This is working fine and the file grows every 60 seconds.
> > >> However unless I force the loop to exit I am unable to use the parquet
> > >> file. In a separate terminal I try to access the parquet file using
> this
> > >> code:
> > >>
> > >> import pandas as pd
> > >> import pyarrow.parquet as pq
> > >>
> > >> table = pq.read_table("openskyflights.parquet")
> > >> df = table.to_pandas()
> > >> print(len(df))
> > >>
> > >> which results in this error:
> > >>
> > >> Traceback (most recent call last):
> > >>  File "checkdownloadsize.py", line 7, in 
> > >>table = pq.read_table("openskyflights.parquet")
> > >>  File
> >
> "/home//.local/share/virtualenvs/opensky-WcPvsoLj/lib/python3.5/site-packages/pyarrow/parquet.py",
> > line 1074, in read_table
> > >>use_pandas_metadata=use_pandas_metadata)
> > >>  File
> >
> "/home//.local/share/virtualenvs/opensky-WcPvsoLj/lib/python3.5/site-packages/pyarrow/filesystem.py",
> > line 182, in read_parquet
> > >>filesystem=self)
> > >>  File
> >
> "/home//.local/share/virtualenvs/opensky-WcPvsoLj/lib/python3.5/site-packages/pyarrow/parquet.py",
> > line 882, in __init__
> > >>self.validate_schemas()
> > >>  File
> >
> "/home//.local/share/virtualenvs/opensky-WcPvsoLj/lib/python3.5/site-packages/pyarrow/parquet.py",
> > line 895, in validate_schemas
> > >>self.schema = self.pieces[0].get_metadata(open_file).schema
> > >>  File
> >
> "/home//.local/share/virtualenvs/opensky-WcPvsoLj/lib/python3.5/si

Re: How to append to parquet file periodically and read intermediate data - pyarrow.lib.ArrowIOError: Invalid parquet file. Corrupt footer.

2018-12-19 Thread Francois Saint-Jacques
Hello Darren,

what Uwe suggests is usually the way to go, your active process writes to a
new file every time. Then you have a parallel process/thread that does
compaction of smaller files in the background such that you don't have too
many files.

On Wed, Dec 19, 2018 at 7:59 AM Uwe L. Korn  wrote:

> Hello Darren,
>
> you're out of luck here. Parquet files are immutable and meant for batch
> writes. Once they're written you cannot modify them anymore. To load them,
> you need to know their metadata which is in the footer. The footer is
> always at the end of the file and written once you call close.
>
> Your use case is normally fulfilled by continously starting new files and
> reading them back in using the ParquetDataset class
>
> Cheers
> Uwe
>
> Am 18.12.2018 um 21:03 schrieb Darren Gallagher :
>
> >> [Cross posted from https://github.com/apache/arrow/issues/3203]
> >>
> >> I'm adding new data to a parquet file every 60 seconds using this code:
> >>
> >> import os
> >> import json
> >> import time
> >> import requests
> >> import pandas as pd
> >> import numpy as np
> >> import pyarrow as pa
> >> import pyarrow.parquet as pq
> >>
> >> api_url = 'https://opensky-network.org/api/states/all'
> >>
> >> cols = ['icao24', 'callsign', 'origin', 'time_position',
> >>'last_contact', 'longitude', 'latitude',
> >>'baro_altitude', 'on_ground', 'velocity', 'true_track',
> >>'vertical_rate', 'sensors', 'geo_altitude', 'squawk',
> >>'spi', 'position_source']
> >>
> >> def get_new_flight_info(writer):
> >>print("Requesting new data")
> >>req = requests.get(api_url)
> >>content = req.json()
> >>
> >>states = content['states']
> >>df = pd.DataFrame(states, columns = cols)
> >>df['timestamp'] = content['time']
> >>print("Found {} new items".format(len(df)))
> >>
> >>table = pa.Table.from_pandas(df)
> >>if writer is None:
> >>writer = pq.ParquetWriter('openskyflights.parquet', table.schema)
> >>writer.write_table(table=table)
> >>return writer
> >>
> >> if __name__ == '__main__':
> >>writer = None
> >>while (not os.path.exists('opensky.STOP')):
> >>writer = get_new_flight_info(writer)
> >>time.sleep(60)
> >>
> >>if writer:
> >>writer.close()
> >>
> >> This is working fine and the file grows every 60 seconds.
> >> However unless I force the loop to exit I am unable to use the parquet
> >> file. In a separate terminal I try to access the parquet file using this
> >> code:
> >>
> >> import pandas as pd
> >> import pyarrow.parquet as pq
> >>
> >> table = pq.read_table("openskyflights.parquet")
> >> df = table.to_pandas()
> >> print(len(df))
> >>
> >> which results in this error:
> >>
> >> Traceback (most recent call last):
> >>  File "checkdownloadsize.py", line 7, in 
> >>table = pq.read_table("openskyflights.parquet")
> >>  File
> "/home//.local/share/virtualenvs/opensky-WcPvsoLj/lib/python3.5/site-packages/pyarrow/parquet.py",
> line 1074, in read_table
> >>use_pandas_metadata=use_pandas_metadata)
> >>  File
> "/home//.local/share/virtualenvs/opensky-WcPvsoLj/lib/python3.5/site-packages/pyarrow/filesystem.py",
> line 182, in read_parquet
> >>filesystem=self)
> >>  File
> "/home//.local/share/virtualenvs/opensky-WcPvsoLj/lib/python3.5/site-packages/pyarrow/parquet.py",
> line 882, in __init__
> >>self.validate_schemas()
> >>  File
> "/home//.local/share/virtualenvs/opensky-WcPvsoLj/lib/python3.5/site-packages/pyarrow/parquet.py",
> line 895, in validate_schemas
> >>self.schema = self.pieces[0].get_metadata(open_file).schema
> >>  File
> "/home//.local/share/virtualenvs/opensky-WcPvsoLj/lib/python3.5/site-packages/pyarrow/parquet.py",
> line 453, in get_metadata
> >>return self._open(open_file_func).metadata
> >>  File
> "/home//.local/share/virtualenvs/opensky-WcPvsoLj/lib/python3.5/site-packages/pyarrow/parquet.py",
> line 459, in _open
> >>reader = open_file_func(self.path)
> >>  File
> "/home//.local/share/virtualenvs/opensky-WcPvsoLj/lib/python3.5/site-packages/pyarrow/parquet.py",
> line 984, in open_file
> >>common_metadata=self.common_metadata)
> >>  File
> "/home//.local/share/virtualenvs/opensky-WcPvsoLj/lib/python3.5/site-packages/pyarrow/parquet.py",
> line 102, in __init__
> >>self.reader.open(source, metadata=metadata)
> >>  File "pyarrow/_parquet.pyx", line 639, in
> pyarrow._parquet.ParquetReader.open
> >>  File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
> >> pyarrow.lib.ArrowIOError: Invalid parquet file. Corrupt footer.
> >>
> >> Is there a way to achieve this?
> >> I'm assuming that if I call writer.close() in the while loop then it
> will
> >> prevent any further data being written to the file? Is there some kind
> of
> >> "flush" operation that can be used to ensure all data is written to disk
> >> and available to other processes or threads that want to read the data?
> >>
> >> Thanks
> >>
>
>

-- 
Se

Re: How to append to parquet file periodically and read intermediate data - pyarrow.lib.ArrowIOError: Invalid parquet file. Corrupt footer.

2018-12-19 Thread Uwe L. Korn
Hello Darren, 

you're out of luck here. Parquet files are immutable and meant for batch 
writes. Once they're written you cannot modify them anymore. To load them, you 
need to know their metadata which is in the footer. The footer is always at the 
end of the file and written once you call close.

Your use case is normally fulfilled by continously starting new files and 
reading them back in using the ParquetDataset class 

Cheers
Uwe

Am 18.12.2018 um 21:03 schrieb Darren Gallagher :

>> [Cross posted from https://github.com/apache/arrow/issues/3203]
>> 
>> I'm adding new data to a parquet file every 60 seconds using this code:
>> 
>> import os
>> import json
>> import time
>> import requests
>> import pandas as pd
>> import numpy as np
>> import pyarrow as pa
>> import pyarrow.parquet as pq
>> 
>> api_url = 'https://opensky-network.org/api/states/all'
>> 
>> cols = ['icao24', 'callsign', 'origin', 'time_position',
>>'last_contact', 'longitude', 'latitude',
>>'baro_altitude', 'on_ground', 'velocity', 'true_track',
>>'vertical_rate', 'sensors', 'geo_altitude', 'squawk',
>>'spi', 'position_source']
>> 
>> def get_new_flight_info(writer):
>>print("Requesting new data")
>>req = requests.get(api_url)
>>content = req.json()
>> 
>>states = content['states']
>>df = pd.DataFrame(states, columns = cols)
>>df['timestamp'] = content['time']
>>print("Found {} new items".format(len(df)))
>> 
>>table = pa.Table.from_pandas(df)
>>if writer is None:
>>writer = pq.ParquetWriter('openskyflights.parquet', table.schema)
>>writer.write_table(table=table)
>>return writer
>> 
>> if __name__ == '__main__':
>>writer = None
>>while (not os.path.exists('opensky.STOP')):
>>writer = get_new_flight_info(writer)
>>time.sleep(60)
>> 
>>if writer:
>>writer.close()
>> 
>> This is working fine and the file grows every 60 seconds.
>> However unless I force the loop to exit I am unable to use the parquet
>> file. In a separate terminal I try to access the parquet file using this
>> code:
>> 
>> import pandas as pd
>> import pyarrow.parquet as pq
>> 
>> table = pq.read_table("openskyflights.parquet")
>> df = table.to_pandas()
>> print(len(df))
>> 
>> which results in this error:
>> 
>> Traceback (most recent call last):
>>  File "checkdownloadsize.py", line 7, in 
>>table = pq.read_table("openskyflights.parquet")
>>  File 
>> "/home//.local/share/virtualenvs/opensky-WcPvsoLj/lib/python3.5/site-packages/pyarrow/parquet.py",
>>  line 1074, in read_table
>>use_pandas_metadata=use_pandas_metadata)
>>  File 
>> "/home//.local/share/virtualenvs/opensky-WcPvsoLj/lib/python3.5/site-packages/pyarrow/filesystem.py",
>>  line 182, in read_parquet
>>filesystem=self)
>>  File 
>> "/home//.local/share/virtualenvs/opensky-WcPvsoLj/lib/python3.5/site-packages/pyarrow/parquet.py",
>>  line 882, in __init__
>>self.validate_schemas()
>>  File 
>> "/home//.local/share/virtualenvs/opensky-WcPvsoLj/lib/python3.5/site-packages/pyarrow/parquet.py",
>>  line 895, in validate_schemas
>>self.schema = self.pieces[0].get_metadata(open_file).schema
>>  File 
>> "/home//.local/share/virtualenvs/opensky-WcPvsoLj/lib/python3.5/site-packages/pyarrow/parquet.py",
>>  line 453, in get_metadata
>>return self._open(open_file_func).metadata
>>  File 
>> "/home//.local/share/virtualenvs/opensky-WcPvsoLj/lib/python3.5/site-packages/pyarrow/parquet.py",
>>  line 459, in _open
>>reader = open_file_func(self.path)
>>  File 
>> "/home//.local/share/virtualenvs/opensky-WcPvsoLj/lib/python3.5/site-packages/pyarrow/parquet.py",
>>  line 984, in open_file
>>common_metadata=self.common_metadata)
>>  File 
>> "/home//.local/share/virtualenvs/opensky-WcPvsoLj/lib/python3.5/site-packages/pyarrow/parquet.py",
>>  line 102, in __init__
>>self.reader.open(source, metadata=metadata)
>>  File "pyarrow/_parquet.pyx", line 639, in 
>> pyarrow._parquet.ParquetReader.open
>>  File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
>> pyarrow.lib.ArrowIOError: Invalid parquet file. Corrupt footer.
>> 
>> Is there a way to achieve this?
>> I'm assuming that if I call writer.close() in the while loop then it will
>> prevent any further data being written to the file? Is there some kind of
>> "flush" operation that can be used to ensure all data is written to disk
>> and available to other processes or threads that want to read the data?
>> 
>> Thanks
>> 



Re: Reviewing PRs (was: Re: Arrow sync call)

2018-12-19 Thread Uwe L. Korn
+1, I would also like to see them in Sphinx.

Uwe 

> Am 19.12.2018 um 11:13 schrieb Antoine Pitrou :
> 
> 
> We should decide where we want to put developer docs.
> 
> I would favour putting them in the Sphinx docs, personally.
> 
> Regards
> 
> Antoine.
> 
> 
>> Le 19/12/2018 à 02:20, Wes McKinney a écrit :
>> Some projects have a REVIEWERS.md file
>> 
>> https://github.com/apache/parquet-mr/blob/master/parquet-common/REVIEWERS.md
>> 
>> We could do the same, or keep the file on the project wiki so it's
>> lighter-weight to change (no pull request required)
>> 
>> https://cwiki.apache.org/confluence/display/ARROW
>> 
>> +1 for adding labels to PRs in any case. We use the [COMPONENT] naming
>> in the title so people can set up e-mail filters (the GitHub labels
>> don't come through in their e-mail notification AFAICT)
>> 
>>> On Tue, Dec 18, 2018 at 1:10 AM Chao Sun  wrote:
>>> 
>>> +1 on adding labels for languages, review states, components, etc. This
>>> makes it much easier to filter PRs.
>>> 
>>> Chao
>>> 
>>> On Wed, Dec 12, 2018 at 11:54 AM Krisztián Szűcs 
>>> wrote:
>>> 
 Create a new one and set arrow-xxx as parent:
 [image: image.png]
 
> On Wed, Dec 12, 2018 at 7:46 PM Antoine Pitrou  wrote:
> 
> 
> Apparently it's possible to create GitHub teams inside the Apache
> organization ourselves.  I've just created a dummy one:
> https://github.com/orgs/apache/teams/arrow-xxx/members
> 
> However, I cannot create a child team inside of the arrow-committers
> team.  The button "Add a team" here is grayed out:
> https://github.com/orgs/apache/teams/arrow-committers/teams
> 
> Regards
> 
> Antoine.
> 
> 
>> Le 12/12/2018 à 19:40, Krisztián Szűcs a écrit :
>> I like the GitHub teams approach. Do We need to ask INFRA to create
> them?
>> 
>>> On Wed, Dec 12, 2018, 7:28 PM Sebastien Binet >> 
>>> On Wed, Dec 12, 2018 at 7:25 PM Antoine Pitrou 
> wrote:
>>> 
 
 Hi,
 
 Now that we have a lot of different implementations and a growing
> number
 of assorted topics, it becomes hard to know whether a PR or issue has
> a
 dedicated expert or would benefit from an outsider look.
 
 In Python we have what we call the "experts" list which is a per-topic
 (or per-library module) contributors who are generally interested in
> and
 competent on such topic (*).  So it's possible to cc such a person, or
 if no expert is available on a given topic, perhaps for someone else
> to
 try and have a look anyway.  Perhaps we need something similar for
> Arrow?
 
>>> 
>>> with github, one can also create "teams" and "@" them.
>>> we could perhaps create @arrow-py, @arrow-cxx, @arrow-go, ...
>>> this dilutes a bit responsibilities but also reduces a bit the net
> that's
>>> cast.
>>> 
>>> -s
>>> 
>>> 
 (*) https://devguide.python.org/experts/
 
 Regards
 
 Antoine.
 
 
 
> Le 12/12/2018 à 19:13, Ravindra Pindikura a écrit :
> Attendees : Wes, Sidd, Bryan, Francois, Hatem, Nick, Shyam, Ravindra,
 Matt
> 
> Wes:
> - do not rush the 0.12 release before the holidays, instead target
> the
 release for early next year
> - request everyone to look at PRs in the queue, and help by doing
>>> reviews
> 
> Wes/Nick
> - queried about Interest in developing a "dataset abstraction" as a
 layer above file readers that arrow now supports (parquet, csv, json)
> 
> Sidd
> - agreed to be the release manager for 0.12
> - things to keep in mind for release managers :
> 1. We now use crossbow to automate the building of binaries with CI
> 2. From this release, the binary artifacts will be hosted in bintray
 instead of apache dist since the size has increased significantly
> 
> Hatem
> - Asked about documentation regarding IDE for setup/debug of arrow
 libraries
> - Wes pointed out the developer wiki on confluence. Hatem offered to
 help with documentation.
> 
> Thanks and regards,
> Ravindra.
> 
>> On 2018/12/12 16:54:21, Wes McKinney  wrote:
>> All are welcome to join -- call notes will be posted after>
>> 
>> https://meet.google.com/vtm-teks-phx>
 
>>> 
>> 
> 
 



Re: Reviewing PRs (was: Re: Arrow sync call)

2018-12-19 Thread Antoine Pitrou


We should decide where we want to put developer docs.

I would favour putting them in the Sphinx docs, personally.

Regards

Antoine.


Le 19/12/2018 à 02:20, Wes McKinney a écrit :
> Some projects have a REVIEWERS.md file
> 
> https://github.com/apache/parquet-mr/blob/master/parquet-common/REVIEWERS.md
> 
> We could do the same, or keep the file on the project wiki so it's
> lighter-weight to change (no pull request required)
> 
> https://cwiki.apache.org/confluence/display/ARROW
> 
> +1 for adding labels to PRs in any case. We use the [COMPONENT] naming
> in the title so people can set up e-mail filters (the GitHub labels
> don't come through in their e-mail notification AFAICT)
> 
> On Tue, Dec 18, 2018 at 1:10 AM Chao Sun  wrote:
>>
>> +1 on adding labels for languages, review states, components, etc. This
>> makes it much easier to filter PRs.
>>
>> Chao
>>
>> On Wed, Dec 12, 2018 at 11:54 AM Krisztián Szűcs 
>> wrote:
>>
>>> Create a new one and set arrow-xxx as parent:
>>> [image: image.png]
>>>
>>> On Wed, Dec 12, 2018 at 7:46 PM Antoine Pitrou  wrote:
>>>

 Apparently it's possible to create GitHub teams inside the Apache
 organization ourselves.  I've just created a dummy one:
 https://github.com/orgs/apache/teams/arrow-xxx/members

 However, I cannot create a child team inside of the arrow-committers
 team.  The button "Add a team" here is grayed out:
 https://github.com/orgs/apache/teams/arrow-committers/teams

 Regards

 Antoine.


 Le 12/12/2018 à 19:40, Krisztián Szűcs a écrit :
> I like the GitHub teams approach. Do We need to ask INFRA to create
 them?
>
> On Wed, Dec 12, 2018, 7:28 PM Sebastien Binet 
>> On Wed, Dec 12, 2018 at 7:25 PM Antoine Pitrou 
 wrote:
>>
>>>
>>> Hi,
>>>
>>> Now that we have a lot of different implementations and a growing
 number
>>> of assorted topics, it becomes hard to know whether a PR or issue has
 a
>>> dedicated expert or would benefit from an outsider look.
>>>
>>> In Python we have what we call the "experts" list which is a per-topic
>>> (or per-library module) contributors who are generally interested in
 and
>>> competent on such topic (*).  So it's possible to cc such a person, or
>>> if no expert is available on a given topic, perhaps for someone else
 to
>>> try and have a look anyway.  Perhaps we need something similar for
 Arrow?
>>>
>>
>> with github, one can also create "teams" and "@" them.
>> we could perhaps create @arrow-py, @arrow-cxx, @arrow-go, ...
>> this dilutes a bit responsibilities but also reduces a bit the net
 that's
>> cast.
>>
>> -s
>>
>>
>>> (*) https://devguide.python.org/experts/
>>>
>>> Regards
>>>
>>> Antoine.
>>>
>>>
>>>
>>> Le 12/12/2018 à 19:13, Ravindra Pindikura a écrit :
 Attendees : Wes, Sidd, Bryan, Francois, Hatem, Nick, Shyam, Ravindra,
>>> Matt

 Wes:
 - do not rush the 0.12 release before the holidays, instead target
 the
>>> release for early next year
 - request everyone to look at PRs in the queue, and help by doing
>> reviews

 Wes/Nick
 - queried about Interest in developing a "dataset abstraction" as a
>>> layer above file readers that arrow now supports (parquet, csv, json)

 Sidd
 - agreed to be the release manager for 0.12
 - things to keep in mind for release managers :
  1. We now use crossbow to automate the building of binaries with CI
  2. From this release, the binary artifacts will be hosted in bintray
>>> instead of apache dist since the size has increased significantly

 Hatem
 - Asked about documentation regarding IDE for setup/debug of arrow
>>> libraries
 - Wes pointed out the developer wiki on confluence. Hatem offered to
>>> help with documentation.

 Thanks and regards,
 Ravindra.

 On 2018/12/12 16:54:21, Wes McKinney  wrote:
> All are welcome to join -- call notes will be posted after>
>
> https://meet.google.com/vtm-teks-phx>
>>>
>>
>

>>>


[jira] [Created] (ARROW-4076) [Python] schema validation and filters

2018-12-19 Thread George Sakkis (JIRA)
George Sakkis created ARROW-4076:


 Summary: [Python] schema validation and filters
 Key: ARROW-4076
 URL: https://issues.apache.org/jira/browse/ARROW-4076
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: George Sakkis


Currently [schema 
validation|https://github.com/apache/arrow/blob/758bd557584107cb336cbc3422744dacd93978af/python/pyarrow/parquet.py#L900]
 of {{ParquetDataset}} takes place before filtering. This may raise a 
{{ValueError}}if the schema is different in some dataset pieces, even if these 
pieces would be subsequently filtered out. I think validation should happen 
after filtering to prevent such spurious errors:
{noformat}
--- a/pyarrow/parquet.py
+++ b/pyarrow/parquet.py
@@ -878,13 +878,13 @@
 if split_row_groups:
 raise NotImplementedError("split_row_groups not yet implemented")
 
-if validate_schema:
-self.validate_schemas()
-
 if filters is not None:
 filters = _check_filters(filters)
 self._filter(filters)
 
+if validate_schema:
+self.validate_schemas()
+
 def validate_schemas(self):
 open_file = self._get_open_file_func()
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)