from:"Julien Le Dem"

Re: Updates to Apache Parquet Twitter account

2024-05-23 Thread Julien Le Dem

Hello,
This is correct. I have updated the website and bio.
Julien

On Mon, May 13, 2024 at 4:53 PM Vinoo Ganesh  wrote:

> We looked into this about a year ago and I think @Julien Le Dem
>  may be the person with access to the Parquet
> twitter.
>
> 
>
>
> On Mon, May 13, 2024 at 4:44 PM Bryce Mecum  wrote:
>
>> Hi all,
>>
>> Andrew Lamb's recent ticket [1] made me take a look at the
>> @ApacheParquet [2] Twitter account and I noticed two things:
>>
>> 1. The associated URL is "parquet.io" which doesn't resolve and should
>> probably be changed to https://parquet.apache.org
>> 2. The Bio could be capitalized and possibly just copied in from
>> whatever gets merged in PARQUET-2470
>>
>> Do others agree and could anyone volunteer to make the change or find
>> someone who has access who could?
>>
>> [1] https://issues.apache.org/jira/projects/PARQUET/issues/PARQUET-2470
>> [2] https://twitter.com/ApacheParquet
>>
>

Re: Is Parquet Meant As a Standalone Database or is a Catalog/Metastore Required?

2024-05-23 Thread Julien Le Dem

I would agree it's a bit of both. The metadata overhead (per data volume)
doesn't increase when you have fewer files.
That being said, you could use fewer of the metadata features in that use
case if the goal is to exchange well formed data without ambiguity.
For wide schema it would be useful to not have to read metadata for columns
you are not reading.

On Wed, May 22, 2024 at 9:26 AM Rok Mihevc  wrote:

> I have worked in small data science/engineering teams where time to do
> engineering is often a luxury and ad hoc data transformations and analysis
> are the norm. In such environments a format that requires a catalog for
> efficient reads will be less effective than one that comes with batteries
> and good defaults included.
>
> Aside: a nice view into ad hoc parque workloads in the wild are kaggle
> forums [1].
>
> [1] https://www.kaggle.com/search?q=parquet
>
> Rok
>
> On Wed, May 22, 2024 at 12:43 AM Micah Kornfield 
> wrote:
>
> > From my perspective I think the answer is more or less both.  Even with
> > only the data lake use-case we see a wide variety of files on what people
> > would be considered to be pushing reasonable boundaries.  To some extent
> > these might be solvable by having libraries have better defaults (e.g.
> only
> > collecting/writing statistics by default for the first N columns).
> >
> >
> >
> > On Tue, May 21, 2024 at 12:56 PM Steve Loughran
> > 
> > wrote:
> >
> > > I wish people would use avro over CSV. Not just for the schema or more
> > > complex structures, but because the parser recognises corrupt files.
> Oh,
> > > and the well defined serialization formats for things like "string" and
> > > "number"
> > >
> > > that said, I generate CSV in test/utility code because it is trivial do
> > it
> > > and then feed straight into a spreadsheet -I'm not trying to use it for
> > > interchange
> > >
> > > On Sat, 18 May 2024 at 17:10, Curt Hagenlocher 
> > > wrote:
> > >
> > > > While CSV is still the undisputed monarch of exchanging data via
> files,
> > > > Parquet is arguably "top 3" -- and this is a scenario in which the
> file
> > > > does really need to be self-contained.
> > > >
> > > > On Sat, May 18, 2024 at 9:01 AM Raphael Taylor-Davies
> > > >  wrote:
> > > >
> > > > > Hi Fokko,
> > > > >
> > > > > I am aware of catalogs such as iceberg, my question was if in the
> > > design
> > > > > of parquet we can assume the existence of such a catalog.
> > > > >
> > > > > Kind Regards,
> > > > >
> > > > > Raphael
> > > > >
> > > > > On 18 May 2024 16:18:22 BST, Fokko Driesprong 
> > > wrote:
> > > > > >Hey Raphael,
> > > > > >
> > > > > >Thanks for reaching out here. Have you looked into table formats
> > such
> > > as
> > > > > Apache
> > > > > >Iceberg ? This seems to
> > fix
> > > > the
> > > > > >problem that you're describing
> > > > > >
> > > > > >A table format adds an ACID layer to the file format and acts as a
> > > fully
> > > > > >functional database. In the case of Iceberg, a catalog is required
> > for
> > > > > >atomicity, and alternatives like Delta Lake also seem to trend
> into
> > > that
> > > > > >direction
> > > > > ><
> > > > >
> > > >
> > >
> >
> https://github.com/orgs/delta-io/projects/10/views/1?pane=issue=57584023
> > > > > >
> > > > > >.
> > > > > >
> > > > > >I'm conscious that for many users this responsibility is instead
> > > > delegated
> > > > > >> to a catalog that maintains its own index structures and
> > statistics,
> > > > > only relies
> > > > > >> on the parquet metadata for very late stage pruning, and may
> > > therefore
> > > > > >> see limited benefit from revisiting the parquet metadata
> > structures.
> > > > > >
> > > > > >
> > > > > >This is exactly what Iceberg offers, it provides additional
> metadata
> > > to
> > > > > >speed up the planning process:
> > > > > >https://iceberg.apache.org/docs/nightly/performance/
> > > > > >
> > > > > >Kind regards,
> > > > > >Fokko
> > > > > >
> > > > > >Op za 18 mei 2024 om 16:40 schreef Raphael Taylor-Davies
> > > > > >:
> > > > > >
> > > > > >> Hi All,
> > > > > >>
> > > > > >> The recent discussions about metadata make me wonder where a
> > storage
> > > > > >> format ends and a database begins, as people seem to have
> > differing
> > > > > >> expectations of parquet here. In particular, one school of
> thought
> > > > > >> posits that parquet should suffice as a standalone technology,
> > where
> > > > > >> users can write parquet files to a store and efficiently query
> > them
> > > > > >> directly with no additional technologies. However, others
> instead
> > > view
> > > > > >> parquet as a storage format for use in conjunction with some
> sort
> > of
> > > > > >> catalog / metastore. These two approaches naturally place very
> > > > different
> > > > > >> demands on the parquet format. The former case incentivizes
> > > > constructing
> > > > > >> extremely large parquet files, potentially on the order of TBs
> > [1],
> > > > such
> > > >

Re: [DISCUSS] Parquet 3 "wide schema" metadata draft

2024-05-23 Thread Julien Le Dem

Yes this is the essence of what I was getting to. Thank you! This makes it
easier to reconnect the discussion.
It is unfortunate that renaming the discussion creates another new thread
:P (we can't win :) )
I agree that we'll end up splitting this in independent discussions
on specific subsets of the docs that we'll label accordingly.

On Sat, May 18, 2024 at 4:30 AM Antoine Pitrou  wrote:

> On Fri, 17 May 2024 07:37:37 -0700
> Julien Le Dem  wrote:
> > This context should be added in the PR description itself.
>
> Good point, I've added context in the PR description. Let me know if
> that's sufficient.
>
> > From a design process perspective, it makes more difficult to converge
> the
> > discussion and build consensus if we start multiple threads rather than
> > keeping the discussion on the original thread.
>
> A single discussion thread won't be able to drive forward all the
> potential changes that we're currently talking about (the Google doc is
> enumerating *a lot* of potential changes).
>
> However, I should have entitled this discussion appropriately.
> The original title is misleading: my PR is only concerned with the "wide
> schema" use case. Let me fix this here :-)
>
> Regards
>
> Antoine.
>
>
>

Re: [DISCUSS] Parquet 3 metadata draft / strawman proposal

2024-05-23 Thread Julien Le Dem

I just wanted to follow up and say thank you Antoine for updating the
description of your PR and bringing the discussion back to the doc. This is
helpful.
https://github.com/apache/parquet-format/pull/242

On Fri, May 17, 2024 at 10:37 AM Julien Le Dem  wrote:

> This context should be added in the PR description itself. My main point
> is to keep the discussion connected rather than starting new threads on
> the mailing list or PRs on github that don't refer to the original doc they
> are connected to.
>
> From a design process perspective, it makes more difficult to converge the
> discussion and build consensus if we start multiple threads rather than
> keeping the discussion on the original thread.
>
> Goals are pretty concrete, but we have to write them down to make them
> clear.
> They are what motivates the change to the metadata. Discussing the changes
> in a PR without agreeing on why we're doing them is premature. Similarly
> before doing benchmarks we need to agree on what we are optimizing for.
>
> PRs
>
>
> On Fri, May 17, 2024 at 1:48 AM Antoine Pitrou  wrote:
>
>>
>> Hi Julien,
>>
>> Yes, I posted comments on Micah's document, and I referenced this PR in
>> those discussions. Personally, I feel more comfortable when I have some
>> concrete proposal to comment on, rather than abstract goals, and I
>> figured other people might be like me. Discussing actual Thrift
>> metadata makes it clearer to me where the friction points might reside,
>> and what the opportunities might be.
>>
>> These changes might also later serve as an experimentation platform to
>> run crude benchmarks and try to validate what's really needed for the
>> wide-schema case to be handled efficiently.
>>
>> They are not intended to be submitted for inclusion anytime soon, and
>> I'm not planning to push for them if someone comes up with something
>> better and more thought out.
>>
>> All in all, this started as a personal investigation to understand
>> whether and how a "v3 schema" could be made backwards-compatible, and
>> when I saw that it seemed actually doable I decided it would be worth
>> posting the initial sketch instead of keeping it for myself.
>>
>> Regards
>>
>> Antoine.
>>
>>
>> On Thu, 16 May 2024 18:41:26 -0700
>> Julien Le Dem  wrote:
>> > Hi Antoine,
>> >
>> > On the other thread Micah is collecting feedback in a document.
>> > https://lists.apache.org/thread/61z98xgq2f76jxfjgn5xfq1jhxwm3jwf
>> >
>> > Would you mind putting your feedback there?
>> > We should collect the goals before jumping to solutions.
>> > It is a bit difficult to discuss those directly in the thrift metadata.
>> >
>> > Thank you
>> >
>> >
>> > On Thu, May 16, 2024 at 4:13 AM Antoine Pitrou <
>> antoine-+zn9apsxkcednm+yrof...@public.gmane.org> wrote:
>> >
>> > >
>> > > Hello,
>> > >
>> > > In the light of recent discussions, I've put up a very rough proposal
>> > > of a Parquet 3 metadata format that allows both for light-weight
>> > > file-level metadata and backwards compatibility with legacy readers.
>> > >
>> > > For the sake of convenience and out of personal preference, I've made
>> > > this a PR to parquet-format rather than a Google Doc:
>> > > https://github.com/apache/parquet-format/pull/242
>> > >
>> > > Feel free to point any glaring mistakes or misunderstandings on my
>> part,
>> > > or to comment on details.
>> > >
>> > > Regards
>> > >
>> > > Antoine.
>> > >
>> > >
>> > >
>> >
>>
>>
>>
>>

Re: [DISCUSS] rename parquet-mr to parquet-java?

2024-05-20 Thread Julien Le Dem

Thank you Andrew!

On Mon, May 20, 2024 at 7:05 AM Andrew Lamb  wrote:

> Here is the infrastructure ticket with the request to rename the
> repository: https://issues.apache.org/jira/browse/INFRA-25802
>
> On Fri, May 17, 2024 at 1:28 PM Prem Sahoo  wrote:
>
> > +1 as it will be apt name .
> > Sent from my iPhone
> >
> > > On May 17, 2024, at 12:32 PM, Daniel Weeks  wrote:
> > >
> > > +1 agree, much cleaner naming
> > >
> > > -Dan
> > >
> > >> On Fri, May 17, 2024 at 8:46 AM Chao Sun  wrote:
> > >>
> > >> +1 too. The name has been confusing for a very long time.
> > >>
> > >>> On Fri, May 17, 2024 at 8:40 AM Fokko Driesprong 
> > wrote:
> > >>>
> > >>> +1 - I think it is much clearer to anyone.
> > >>>
> > >>> GitHub will handle all the redirects from the old to the new name, so
> > no
> > >>> reason from my end to not rename it :)
> > >>>
> > >>> Cheers, Fokko
> > >>>
> > >>> Op vr 17 mei 2024 om 17:30 schreef Julien Le Dem  >:
> > >>>
> > >>>> +1
> > >>>> I should have named it that to start with.
> > >>>>
> > >>>>
> > >>>> On Fri, May 17, 2024 at 3:27 AM Wang, Yuming
>  > >>>
> > >>>> wrote:
> > >>>>
> > >>>>> +10086
> > >>>>>
> > >>>>> From: Uwe L. Korn 
> > >>>>> Date: Thursday, May 16, 2024 at 15:41
> > >>>>> To: dev@parquet.apache.org 
> > >>>>> Subject: Re: [DISCUSS] rename parquet-mr to parquet-java?
> > >>>>> External Email
> > >>>>>
> > >>>>> very heavy +1
> > >>>>>
> > >>>>> This would help a lot.
> > >>>>>
> > >>>>> On Thu, May 16, 2024, at 4:19 AM, Gang Wu wrote:
> > >>>>>> +1 on renaming the repo to reduce confusion.
> > >>>>>>
> > >>>>>> However, the java library still uses the "parquet-mr" prefix to
> > >> write
> > >>>> its
> > >>>>>> application version [1] and it is consumed by downstream projects
> > >>> like
> > >>>>>> parquet-cpp [2] as well.
> > >>>>>>
> > >>>>>> [1]
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>
> >
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fsearch%3Fq%3Drepo%253Aapache%252Fparquet-mr%2Bparquet-mr%2Blanguage%253AJava%26type%3Dcode%26l%3DJava=05%7C02%7Cyumwang%40ebay.com%7Cfed824d59ca84cb3004408dc757b882c%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C638514420629473555%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C=E4AZ8hEbQyCLy3aXAU2umohUlTCksqHVO5Imfc%2BM6p0%3D=0
> > >>>>> <
> > >>>>>
> > >>>>
> > >>>
> > >>
> >
> https://github.com/search?q=repo%3Aapache%2Fparquet-mr+parquet-mr+language%3AJava=code=Java
> > >>>>>>
> > >>>>>> [2]
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>
> >
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fsearch%3Fq%3Drepo%253Aapache%252Farrow%2Bparquet-mr%2Blanguage%253AC%252B%252B%2B%26type%3Dcode=05%7C02%7Cyumwang%40ebay.com%7Cfed824d59ca84cb3004408dc757b882c%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C638514420629484103%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C=cuaTrenLP4WKex0Mbdk7DbbhzcEP45jhqwj5swRZ5Pk%3D=0
> > >>>>> <
> > >>>>>
> > >>>>
> > >>>
> > >>
> >
> https://github.com/search?q=repo%3Aapache%2Farrow+parquet-mr+language%3AC%2B%2B+=code
> > >>>>>>
> > >>>>>>
> > >>>>>> Best,
> > >>>>>> Gang
> > >>>>>>
> > >>>>>> On Thu, May 16, 2024 at 12:47 AM Vinoo Ganesh <
> > >>> vinoo.gan...@gmail.com>
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>>

Re: [C++] Parquet and Arrow overlap

2024-05-17 Thread Julien Le Dem

If we deem that it would be too hard to move it back for the moment, we
need at a minimum to clarify and reduce the confusion.
If practice doesn't match what the PMC voted on, we need to improve the
practice.
Do we have suggestions on improving that?
perhaps OWNERSFILE in the parquet folder in the arrow repo? (just an idea)

On Fri, May 17, 2024 at 2:49 AM Uwe L. Korn  wrote:

>
>
> On Fri, May 17, 2024, at 10:36 AM, Antoine Pitrou wrote:
> > Hi Julien,
> >
> > On Thu, 16 May 2024 18:23:33 -0700
> > Julien Le Dem  wrote:
> >>
> >> As discussed, that code was moved in the arrow repo for convenience:
> >> https://lists.apache.org/thread/gkvbm6yyly1r4cg3f6xtnqkjz6ogn6o2
> >>
> >> To take an excerpt of that original decision:
> >>
> >> 4) The Parquet and Arrow C++ communities will collaborate to provide
> >> development workflows to enable contributors working exclusively on the
> >> Parquet core functionality to be able to work unencumbered with
> unnecessary
> >> build or test dependencies from the rest of the Arrow codebase. Note
> that
> >> parquet-cpp already builds a significant portion of Apache Arrow en
> route
> >> to creating its libraries 5) The Parquet community can create scripts to
> >> "cut" Parquet C++ releases by packaging up the appropriate components
> and
> >> ensuring that they can be built and installed independently as now
> >
> > Unfortunately, these two points haven't happened at all. On the
> > contrary, the Arrow C++ dependency has infused much deeper in Parquet
> > C++ (I was not there at the beginning of Parquet C++, but I get the
> > impression there was originally an effort to have a Arrow-independent
> > Parquet C++ core; that "core" doesn't exist anymore).
>
> As an example, we had in the beginning separate I/O primitives in Arrow
> and Parquet. But during the further development, we realised that we were
> implementing exactly the same code paths only in different namespaces.
>
> There are some core "utilities" hidden in Arrow that are required to build
> any modern C++based data processing library. If you would separate that
> into its own repository would enable parquet-cpp to be separated more
> easily. But given that the development around this is still very active in
> Arrow, it would bring a massive slowdown to the overall project.
>
> Best
> Uwe
>

Re: [DISCUSS] rename parquet-mr to parquet-java?

2024-05-17 Thread Julien Le Dem

+1
I should have named it that to start with.


On Fri, May 17, 2024 at 3:27 AM Wang, Yuming 
wrote:

> +10086
>
> From: Uwe L. Korn 
> Date: Thursday, May 16, 2024 at 15:41
> To: dev@parquet.apache.org 
> Subject: Re: [DISCUSS] rename parquet-mr to parquet-java?
> External Email
>
> very heavy +1
>
> This would help a lot.
>
> On Thu, May 16, 2024, at 4:19 AM, Gang Wu wrote:
> > +1 on renaming the repo to reduce confusion.
> >
> > However, the java library still uses the "parquet-mr" prefix to write its
> > application version [1] and it is consumed by downstream projects like
> > parquet-cpp [2] as well.
> >
> > [1]
> >
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fsearch%3Fq%3Drepo%253Aapache%252Fparquet-mr%2Bparquet-mr%2Blanguage%253AJava%26type%3Dcode%26l%3DJava=05%7C02%7Cyumwang%40ebay.com%7Cfed824d59ca84cb3004408dc757b882c%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C638514420629473555%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C=E4AZ8hEbQyCLy3aXAU2umohUlTCksqHVO5Imfc%2BM6p0%3D=0
> <
> https://github.com/search?q=repo%3Aapache%2Fparquet-mr+parquet-mr+language%3AJava=code=Java
> >
> > [2]
> >
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fsearch%3Fq%3Drepo%253Aapache%252Farrow%2Bparquet-mr%2Blanguage%253AC%252B%252B%2B%26type%3Dcode=05%7C02%7Cyumwang%40ebay.com%7Cfed824d59ca84cb3004408dc757b882c%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C638514420629484103%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C=cuaTrenLP4WKex0Mbdk7DbbhzcEP45jhqwj5swRZ5Pk%3D=0
> <
> https://github.com/search?q=repo%3Aapache%2Farrow+parquet-mr+language%3AC%2B%2B+=code
> >
> >
> > Best,
> > Gang
> >
> > On Thu, May 16, 2024 at 12:47 AM Vinoo Ganesh 
> > wrote:
> >
> >> +1, I think this will make things a lot clearer! (non-binding)
> >>
> >> 
> >>
> >>
> >> On Wed, May 15, 2024 at 12:36 PM Jacques Nadeau 
> >> wrote:
> >>
> >> > +1000
> >> >
> >> > On Wed, May 15, 2024 at 6:30 AM Andrew Lamb 
> >> > wrote:
> >> >
> >> > > Julien had a great suggestion[1] to  rename the parquet-mr
> repository
> >> to
> >> > > parquet-java to reduce confusion about its content.
> >> > >
> >> > >  > This looks great. Thank you for taking the initiative. Hadoop is
> not
> >> > > required indeed. Perhaps at some point we should rename parquet-mr
> to
> >> > > parquet-java?
> >> > >
> >> > > Having just renamed
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow-datafusion=05%7C02%7Cyumwang%40ebay.com%7Cfed824d59ca84cb3004408dc757b882c%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C638514420629491325%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C=ANZoBX%2B7w4Uu%2BCvBfiuUBLRXqAIF5KDSmtm%2BCtQmroc%3D=0
>  to
> >> > >
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fdatafusion=05%7C02%7Cyumwang%40ebay.com%7Cfed824d59ca84cb3004408dc757b882c%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C638514420629496155%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C=CLaLQrZNT761zudqaoMkY2EeC%2F5MGdOeY5veMRe5WcI%3D=0
>  I think this would be a
> >> relatively
> >> > > painless experience as all existing links still work
> >> > >
> >> > > I filed a ticket here
> >>
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FPARQUET-2475=05%7C02%7Cyumwang%40ebay.com%7Cfed824d59ca84cb3004408dc757b882c%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C638514420629500280%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C=0E9HcvNlNqJoXCfCI7Nzl0Frphoqnlvu9gWb%2Fd5%2BOXM%3D=0
> 
> >> > >
> >> > > Thoughts?
> >> > > Andrew
> >> > >
> >> > > [1]
> >> > >
> >> >
> >>
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-site%2Fpull%2F59%23pullrequestreview-2056038304=05%7C02%7Cyumwang%40ebay.com%7Cfed824d59ca84cb3004408dc757b882c%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C638514420629504472%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C=fTAB%2FtTFPGyqrJR11uCGYfjGUcmMAJL8q7LykgK%2BzXM%3D=0
> <
> https://github.com/apache/parquet-site/pull/59#pullrequestreview-2056038304
> >
> >> > >
> >> >
> >>
>

Re: [DISCUSS] Parquet Reference Implementation ?

2024-05-17 Thread Julien Le Dem

It's not just whether it's readable or not.
It is also whether the format allows reaching the performance
characteristics expected.
*A* reference implementation should be developed at the same time as the
format change to confirm that we reach the stated goals.
This is needed whether we consider it *the* reference implementation or
just *a* reference implementation for this particular change.


On Fri, May 17, 2024 at 2:51 AM Steve Loughran 
wrote:

> I'd argue the compatibility across implementation is "can they correctly
> read the data generated by the others?", so there's less of an RI than
> compliance testing, the way closed source stuff often works.
>
> Specification
>
>1. Files generated by the implementation which are believed to match the
>specification
>2. Assertions about the contents of these files (this is
>something which needs to be declared in a way that can be used by test
>runners of the different implementations, so tricky.
>3. Tests which validate those assertions on the parsed contents
>
>
> I've never done anything like this before. maybe tanyone who has tried to
> implement an SQL standard has some suggestions. Indeed, SQL might be
> language for those assertions, which would then have to go through
> spark/hive/impala/etc for validation. Which is ultimately what you want,
> just a lot harder to build, test, debug and identify what is broken
>
> On Fri, 17 May 2024 at 09:40, Antoine Pitrou  wrote:
>
> >
> > +1 (non-binding :-)) on the idea of having a shortlist of "accredited"
> > implementations.
> >
> > I would suggest to add a third implementation such as parquet-rs, since
> > its authors are active here; especially as the Parquet Java and C++
> > teams seem to have some overlap historically, and a third
> > implementation helps bring different perspectives.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > On Thu, 16 May 2024 17:37:35 -0700
> > Julien Le Dem  wrote:
> > > I would support it as long as we maintain a list of the implementations
> > > that we consider "accredited" to be reference implementations (we
> being a
> > > PMC vote here).
> > > Not all implementations are created equal from an adoption point of
> view.
> > > Originally the Impala implementation was the second implementation for
> > > interrop. Later on the parquet-cpp implementation was added as a
> > standalone
> > > implementation in the Parquet project. This is the implementation that
> > > lives in the arrow repository.
> > > The parquet java implementation and the parquet cpp implementation in
> the
> > > arrow repo are on top of that list IMO.
> > >
> > >
> > > On Thu, May 16, 2024 at 6:17 AM Rok Mihevc <
> > rok.mihevc-re5jqeeqqe8avxtiumw...@public.gmane.org> wrote:
> > >
> > > > I would support a "two interoperable open source implementations"
> > > > requirement.
> > > >
> > > > Rok
> > > >
> > > > On Thu, May 16, 2024 at 10:06 AM Antoine Pitrou 
> > > > wrote:
> > > >
> > > > >
> > > > > I'm in (non-binding) agreement with Ed here. I would just add that
> > the
> > > > > requirement for two interoperable implementations should mandate
> that
> > > > > these are open source implementations.
> > > > >
> > > > > Regards
> > > > >
> > > > > Antoine.
> > > > >
> > > > >
> > > > > On Tue, 14 May 2024 14:48:09 -0700
> > > > > Ed Seidl  wrote:
> > > > > > Given the breadth of the parquet community at this point, I don't
> > think
> > > > > > we should be singling out one or two "reference" implementations.
> > Even
> > > > > > parquet-mr, AFAIK, still doesn't implement
> DELTA_LENGTH_BYTE_ARRAY
> > > > > > encoding in a user-accessible way (it's only available as part of
> > the
> > > > > > DELTA_BYTE_ARRAY writer). There are many situations in which the
> > > > > > former would be the superior choice, and in fact the
> specification
> > > > > > documentation still lists DLBA as "always preferred over PLAIN
> for
> > byte
> > > > > > array columns" [1]. Similarly, DELTA_BYTE_ARRAY encoding was only
> > added
> > > > > > to parquet-cpp in the last year [2], and column indexes a few
> > months
> > > > > > before that

Re: [DISCUSS] Parquet 3 metadata draft / strawman proposal

2024-05-17 Thread Julien Le Dem

This context should be added in the PR description itself. My main point is
to keep the discussion connected rather than starting new threads on
the mailing list or PRs on github that don't refer to the original doc they
are connected to.

>From a design process perspective, it makes more difficult to converge the
discussion and build consensus if we start multiple threads rather than
keeping the discussion on the original thread.

Goals are pretty concrete, but we have to write them down to make them
clear.
They are what motivates the change to the metadata. Discussing the changes
in a PR without agreeing on why we're doing them is premature. Similarly
before doing benchmarks we need to agree on what we are optimizing for.

PRs


On Fri, May 17, 2024 at 1:48 AM Antoine Pitrou  wrote:

>
> Hi Julien,
>
> Yes, I posted comments on Micah's document, and I referenced this PR in
> those discussions. Personally, I feel more comfortable when I have some
> concrete proposal to comment on, rather than abstract goals, and I
> figured other people might be like me. Discussing actual Thrift
> metadata makes it clearer to me where the friction points might reside,
> and what the opportunities might be.
>
> These changes might also later serve as an experimentation platform to
> run crude benchmarks and try to validate what's really needed for the
> wide-schema case to be handled efficiently.
>
> They are not intended to be submitted for inclusion anytime soon, and
> I'm not planning to push for them if someone comes up with something
> better and more thought out.
>
> All in all, this started as a personal investigation to understand
> whether and how a "v3 schema" could be made backwards-compatible, and
> when I saw that it seemed actually doable I decided it would be worth
> posting the initial sketch instead of keeping it for myself.
>
> Regards
>
> Antoine.
>
>
> On Thu, 16 May 2024 18:41:26 -0700
> Julien Le Dem  wrote:
> > Hi Antoine,
> >
> > On the other thread Micah is collecting feedback in a document.
> > https://lists.apache.org/thread/61z98xgq2f76jxfjgn5xfq1jhxwm3jwf
> >
> > Would you mind putting your feedback there?
> > We should collect the goals before jumping to solutions.
> > It is a bit difficult to discuss those directly in the thrift metadata.
> >
> > Thank you
> >
> >
> > On Thu, May 16, 2024 at 4:13 AM Antoine Pitrou <
> antoine-+zn9apsxkcednm+yrof...@public.gmane.org> wrote:
> >
> > >
> > > Hello,
> > >
> > > In the light of recent discussions, I've put up a very rough proposal
> > > of a Parquet 3 metadata format that allows both for light-weight
> > > file-level metadata and backwards compatibility with legacy readers.
> > >
> > > For the sake of convenience and out of personal preference, I've made
> > > this a PR to parquet-format rather than a Google Doc:
> > > https://github.com/apache/parquet-format/pull/242
> > >
> > > Feel free to point any glaring mistakes or misunderstandings on my
> part,
> > > or to comment on details.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > >
> >
>
>
>
>

Re: [DISCUSS] Parquet 3 metadata draft / strawman proposal

2024-05-16 Thread Julien Le Dem

Hi Antoine,

On the other thread Micah is collecting feedback in a document.
https://lists.apache.org/thread/61z98xgq2f76jxfjgn5xfq1jhxwm3jwf

Would you mind putting your feedback there?
We should collect the goals before jumping to solutions.
It is a bit difficult to discuss those directly in the thrift metadata.

Thank you

On Thu, May 16, 2024 at 4:13 AM Antoine Pitrou  wrote:

>
> Hello,
>
> In the light of recent discussions, I've put up a very rough proposal
> of a Parquet 3 metadata format that allows both for light-weight
> file-level metadata and backwards compatibility with legacy readers.
>
> For the sake of convenience and out of personal preference, I've made
> this a PR to parquet-format rather than a Google Doc:
> https://github.com/apache/parquet-format/pull/242
>
> Feel free to point any glaring mistakes or misunderstandings on my part,
> or to comment on details.
>
> Regards
>
> Antoine.
>
>
>

Re: [C++] Parquet and Arrow overlap

2024-05-16 Thread Julien Le Dem

> Hmm... I'm not sure I understand your point here. The Parquet spec and
> the Java implementation are already living in distinct repos and have
> distinct versioning schemes. The main thing that they share in common is
> the JIRA instance (while the C++ Parquet implementation mostly relies on
> Arrow's GH issue tracker), but is that really important?

It is not a problem that they are in separate repos. The problem is the
friction created because it makes access control difficult and creates
confusion on governance.
This thread "[DISCUSS] Parquet C++ under which PMC?" is a clear example of
it: https://lists.apache.org/thread/128wv5cwv51scm8vdfn1g9gskw717qyt

All I'm suggesting is that if the inconvenience this creates around unclear
governance discussions is greater than the convenience of being in the same
repo, we should revisit that decision.
It would be less inconvenient today to have parquet-cpp in its own repo
than it was at the time.

As discussed, that code was moved in the arrow repo for convenience:
https://lists.apache.org/thread/gkvbm6yyly1r4cg3f6xtnqkjz6ogn6o2

To take an excerpt of that original decision:

4) The Parquet and Arrow C++ communities will collaborate to provide
development workflows to enable contributors working exclusively on the
Parquet core functionality to be able to work unencumbered with unnecessary
build or test dependencies from the rest of the Arrow codebase. Note that
parquet-cpp already builds a significant portion of Apache Arrow en route
to creating its libraries 5) The Parquet community can create scripts to
"cut" Parquet C++ releases by packaging up the appropriate components and
ensuring that they can be built and installed independently as now

development workflows to enable contributors working exclusively on the
Parquet core functionality to be able to work unencumbered with unnecessary
build or test dependencies from the rest of the Arrow codebase. Note that
parquet-cpp already builds a significant portion of Apache Arrow en route
to creating its libraries

The alternative is to live up to the part where we agreed that the two
communities collaborate on making it easy for the Parquet community to
govern its code base in the arrow repo.
Would you agree?

On Thu, May 16, 2024 at 1:00 AM Micah Kornfield 
wrote:

> From my perspective I agree, that I don't think there is benefit of moving
> parquet C++ out of arrow given what it would actually cost to make clean
> boundaries.  I also don't think it will hurt iteration speed.
>
> I think the main challenge could be in compatibility testing, but Arrow has
> solved this between implementations that live in different repositories so
> I think the same solutions could apply for Parquet.
>
> On Thu, May 16, 2024 at 12:57 AM Antoine Pitrou 
> wrote:
>
> > On Tue, 14 May 2024 10:22:37 -0700
> > Julien Le Dem  wrote:
> > > 1. I think we should make it easy for people contributing to the C++
> > > codebase. (which is why I voted for the move at the time)
> > > 2. If merging repos removes the need to deal with the circular
> dependency
> > > between repos issue for the C++ code bases, it does it at the expense
> of
> > > making it easy to evolve the parquet spec and the java and c++
> > > implementations together.
> >
> > Hmm... I'm not sure I understand your point here. The Parquet spec and
> > the Java implementation are already living in distinct repos and have
> > distinct versioning schemes. The main thing that they share in common is
> > the JIRA instance (while the C++ Parquet implementation mostly relies on
> > Arrow's GH issue tracker), but is that really important?
> >
> > > parquet-cpp depends only on arrow-core that does not have to depend on
> > > parquet-cpp.
> >
> > That is true.
> >
> > > Other components like
> > > arrow-dataset and pyarrow can depend on parquet-cpp just like they
> depend
> > > on orc externally.
> >
> > Ideally yes. In practice there are two problems:
> > 1) it creates a circular dependency between *repositories*.
> > 2) the C++ Arrow Datasets component is not built independently, it is an
> > optional component when building Arrow C++. So we would also have a
> > chicken-and-egg problem when building Arrow C++ and Parquet C++.
> >
> > > I realize that would be work to make it happen, but the current
> location
> > of
> > > the parquet-cpp codebase is a big trade-off of prioritizing quick
> > iteration
> > > on the C++ implementations over iteration on the format.
> >
> > Having recently worked on a format addition and its respective
> > implementations (in Java and C++), I haven't found the current setup
> > more difficult to work with for Parquet C++ than it was for Parquet
> > Java. Admittedly I'm biased, being a heavy contributor to Arrow C++,
> > but I'm curious why the current situation would be detrimental to
> > iteration on the format.
> >
> > Regards
> >
> > Antoine.
> >
> >
> >
>

Re: [DISCUSS] Parquet Reference Implementation ?

2024-05-16 Thread Julien Le Dem

I would support it as long as we maintain a list of the implementations
that we consider "accredited" to be reference implementations (we being a
PMC vote here).
Not all implementations are created equal from an adoption point of view.
Originally the Impala implementation was the second implementation for
interrop. Later on the parquet-cpp implementation was added as a standalone
implementation in the Parquet project. This is the implementation that
lives in the arrow repository.
The parquet java implementation and the parquet cpp implementation in the
arrow repo are on top of that list IMO.


On Thu, May 16, 2024 at 6:17 AM Rok Mihevc  wrote:

> I would support a "two interoperable open source implementations"
> requirement.
>
> Rok
>
> On Thu, May 16, 2024 at 10:06 AM Antoine Pitrou 
> wrote:
>
> >
> > I'm in (non-binding) agreement with Ed here. I would just add that the
> > requirement for two interoperable implementations should mandate that
> > these are open source implementations.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > On Tue, 14 May 2024 14:48:09 -0700
> > Ed Seidl  wrote:
> > > Given the breadth of the parquet community at this point, I don't think
> > > we should be singling out one or two "reference" implementations. Even
> > > parquet-mr, AFAIK, still doesn't implement DELTA_LENGTH_BYTE_ARRAY
> > > encoding in a user-accessible way (it's only available as part of the
> > > DELTA_BYTE_ARRAY writer). There are many situations in which the
> > > former would be the superior choice, and in fact the specification
> > > documentation still lists DLBA as "always preferred over PLAIN for byte
> > > array columns" [1]. Similarly, DELTA_BYTE_ARRAY encoding was only added
> > > to parquet-cpp in the last year [2], and column indexes a few months
> > > before that [3].
> > >
> > > Instead, I think we should leave out any mention of a reference
> > > implementation,
> > > and continue to require two, independent, interoperable implementations
> > > before adopting a change to the spec. This, IMO, would go a long way
> > towards
> > > increasing excitement for Parquet outside the parquet-mr/arrow world.
> > >
> > > Just my (non-binding) two cents.
> > >
> > > Cheers,
> > > Ed
> > >
> > > [1]
> > >
> >
> https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-length-byte-array-delta_length_byte_array--6
> > > [2] https://github.com/apache/arrow/pull/14341
> > > [3] https://github.com/apache/arrow/pull/34054
> > >
> > > On 5/14/24 9:44 AM, Julien Le Dem wrote:
> > > > I agree that parquet-mr implementation is a requirement to evolve the
> > spec.
> > > > It makes sense to me that we call parquet-mr the reference
> > implementation
> > > > and make it a requirement to evolve the spec.
> > > > I would add the requirement to implement it in the parquet cpp
> > > > implementation that lives in apache Arrow:
> > > > https://github.com/apache/arrow/tree/main/cpp/src/parquet
> > > > This code used to live in the parquet-cpp repo in the Parquet
> project.
> > > > Being language agnostic is an important feature of the format.
> > > > Interoperability tests should also be included.
> > > >
> > > > On Tue, May 14, 2024 at 9:31 AM Antoine Pitrou <
> > antoine-+zn9apsxkcednm+yrof...@public.gmane.org> wrote:
> > > >
> > > >> AFAIK, the only Parquet implementation under the Apache Parquet
> > project
> > > >> is parquet-mr :-)
> > > >>
> > > >>
> > > >> On Tue, 14 May 2024 10:58:58 +0200
> > > >> Rok Mihevc  wrote:
> > > >>> Second Raphael's point.
> > > >>> Would it be reasonable to say specification change requires
> > > >> implementation
> > > >>> in two parquet implementations within Apache Parquet project?
> > > >>>
> > > >>> Rok
> > > >>>
> > > >>> On Tue, May 14, 2024 at 10:50 AM Gang Wu <
> > > >> ustcwg-re5jqeeqqe8avxtiumw...@public.gmane.org> wrote:
> > > >>>> IMHO, it looks more reasonable if a reference implementation is
> > > >> required
> > > >>>> to support most (not all) elements from the specification.
> > > >>>>
> > > >>>> Another question is: should we discuss

Re: [DISCUSS] Parquet C++ under which PMC?

2024-05-16 Thread Julien Le Dem

Here is the thread we voted on at the time:
https://lists.apache.org/thread/gkvbm6yyly1r4cg3f6xtnqkjz6ogn6o2
and the thread calling the result:
https://lists.apache.org/thread/76wzx2lsbwjl363bg066g8kdsocd03rw

This thread calls for giving access of Parquet committers to this part of
the repo and contribute to this code base. Asking for good collaboration
between Parquet and Arrow committers here. There was and still is a lot of
overlap between the parquet and arrow committers.
The access mechanisms are tied to repos, so it does not make this easy.

At the time the dependency management in the C++ repos (Parquet and Arrow)
and the changing APIs made things difficult, which prompted moving those
two in the same repo.
Now that those APIs are more stable I do think splitting the repos would be
easier.
Arrow is not a monorepo anymore like it was at the time.

That would clarify things from an access control perspective.


On Thu, May 16, 2024 at 6:41 AM Andrew Lamb  wrote:

> > . Warranted or not, there is still a perception among some that parquet
> is closely tied to the Spark / Hadoop ecosystems,
>
> It certainly doesn't help that https://parquet.apache.org explicitly says
> it is for Hadoop: "Apache Parquet is a columnar storage format available to
> any project in the Hadoop ecosystem, regardless of the choice of data
> processing framework, data model or programming language." right on the
> front page.
>
> Shameless plug for a committer to merge my PR[1] to the site that makes it
> clearer parquet is more general.
>
> Andrew
>
> [1]" https://github.com/apache/parquet-site/pull/59
>
> On Thu, May 16, 2024 at 9:37 AM Raphael Taylor-Davies
>  wrote:
>
> > I can't speak for other's motivations, but for me it is about better
> > communicating parquet as a format specification, with a number of
> > implementations in different languages, as opposed to a specific Java
> > implementation. Perhaps something closer to the approach of arrow, where
> > there is a family of first-party implementations, across a number of
> > different languages, that all work together to ensure interoperability,
> > evolve the specification, etc... Warranted or not, there is still a
> > perception among some that parquet is closely tied to the Spark / Hadoop
> > ecosystems, and only useful as a means of interoperating with said
> > ecosystems.
> >
> > On 16/05/2024 14:11, Rok Mihevc wrote:
> > > What are the benefits of a parquet implementation being part of Apache
> > > Parquet vs another Apache project vs something else entirely?
> > > Being part of Apache org? Branding? Voting rights?
> > > If motivations are clear, solutions might be more readily apparent.
> > >
> > > Rok
> > >
> > > On Thu, May 16, 2024 at 2:36 PM Raphael Taylor-Davies
> > >  wrote:
> > >
> > >> I'm curious where the other arrow parquet implementations fit into
> this,
> > >> if at all? For context, the original Rust implementation was largely
> the
> > >> work of Chao Sun, who I believe to be a parquet PMC member, but it was
> > >> then donated to the arrow project, and has primarily been developed
> and
> > >> maintained by individuals affiliated with the arrow project since
> then,
> > >> myself included. I'm not suggesting all parquet implementations
> > >> necessarily need to be governed by the parquet PMC, but perhaps what
> > >> ever compromise we devise for parquet-cpp might equally be applied to
> > >> the other parquet projects that fall under the arrow umbrella?
> > >>
> > >> Kind Regards,
> > >>
> > >> Raphael
> > >>
> > >> On 16/05/2024 13:26, Uwe L. Korn wrote:
> > >>> I would actually consider someone who contributes to both communities
> > at
> > >> the same time to be a worthwhile addition to both projects. In my
> active
> > >> years, we have mostly voted people into both projects; the order was
> not
> > >> clear, though.
> > >>> Being a committer/PMC means that you want to bring the community
> around
> > >> a project forward in the Apache way (with parquet-cpp this is given as
> > it
> > >> is part of the parquet community and also still in a project that is
> > >> residing within the Apache org).
> >  he told me that the contribution to
> >  parquet-cpp is no longer considered when promoting committers to
> >  Apache Parquet PMC.
> > >>> As a Parquet PMC, I would strongly object to that and would be
> > >> supportive of also making them a Parquet committer/PMC.
> > >>> Best
> > >>> Uwe
> > >>>
> > >>> On Thu, May 16, 2024, at 2:19 PM, Gang Wu wrote:
> >  Hi,
> > 
> >  I share the same feeling with Antoine that parquet-cpp seems to be
> > fully
> >  governed by Apache Arrow PMC, not the Apache Parquet PMC. I have
> >  once discussed this with Xinli and he told me that the contribution
> to
> >  parquet-cpp is no longer considered when promoting committers to
> >  Apache Parquet PMC.
> > 
> >  Best,
> >  Gang
> > 
> >  On Thu, May 16, 2024 at 4:29 PM Antoine Pitrou 
> > >> wrote:
>

Re: Interest in Parquet V3

2024-05-15 Thread Julien Le Dem

s true. I have seen this stance on this mailing list and
> in the Parquet community a lot in the past years, and even if there might
> be a speck of truth to it, it is again a defeatist stance that in the end
> hurts Parquet. This might be true for v2, for the problems mentioned above.
> But any new feature that displays tangible improvements will be adopted
> rather quickly by implementations. My company would implement new encodings
> that promise more compression while not making decoding slower with high
> priority. And so would other data lake vendors. With this, the chicken egg
> problem mentioned above would be resolved: The more vendors use new
> encodings in their lakes, the more pressure to support these is put onto
> all implementations.
>
> One valid argument against v3 that was already brought up repeatedly is
> that if we completely need to gut Parquet and replace many aspects of it to
> reach the goals of v3, then the resulting format just isn't Parquet
> anymore. So maybe we just need to move on one day to a format that is
> completely different, but until then, I would love to see improvements in
> Parquet. The good thing about making improvements in Parquet instead of
> switching to a totally different format is that we can mix and match and
> still retain the countless optimizations we have implemented for Parquet
> over the years.
>
> So, what are the shortcomings that should be fixed? A lot of good points
> have already been mentioned. As yet another data point, I want to depict
> the points that we struggle with, ordered by severity:
>
>
>- Missing random access. Parquet isn't made for random access, and while
>this is okay for most queries that just scan the whole file, there are
> many
>scenarios where this is a problem. Queries can filter out many rows and
> if
>the format then still requires doing a lot of work, this is a problem.
>Also, things like secondary indexes are hard if you do not have random
>access. For example, extracting a single row with a known row index
> from a
>Parquet file requires an insane amount of work. In contrast, in our own
>format [1], we have made sure that all encodings we use allow for O(1)
>random access. This means that we cannot use some nice encodings (e.g.
>RLE), but therefore we can access any value with just a few assembly
>instructions. The good thing about Parquet is that it gives choices to
> the
>user. Not all encodings need to allow fast random access, but there
> should
>be some for all data types, so that users that require fast random
> access
>can use these. Here are the top missing pieces IMHO:
>   - PLAIN encoding for strings doesn't allow random access, as it
>   interleaves string lengths with string data. This is just
> unnecessary, as
>   it is simple to have an encoding that does not have this without any
> real
>   drawbacks (e.g., see how Arrow does it with an offset array and
> separate
>   string data). We should propose such a new string PLAIN encoding and
>   deprecate the current one. Not only does the current one not allow
> random
>   access, it is also slow to decode as due to the interleaved lengths,
>   reading values has a data dependency on the length before, so the CPU
>   cannot out-of-order execute a scan.
>   - Metadata decoding is all-or-nothing, as already discussed. This
>   exacerbates the random I/O problem.
>   - To randomly access a column with NULL values, we first need prefix
>   sums over the D-Levels to know which encoded value is the one
> we're looking
>   for. There should be a way to encode a column with NULLs in ways
> where NULL
>   values are represented explicitly in the data. This increases memory
>   consumption, but allows fast random access. It's a trade-off,
> but one that
>   we would like to have in Parquet.
>- A lot of new encodings have been proposed lately having good
>compression while allowing fast, vectorized, decompression. Many of them
>also allow random access. It is hard to find a good list of encodings to
>add, so we gain most benefits while not bloating the amount of
> encodings,
>which would put undue implementation burden on each implementation.
>- As discussed, a simple feature bitmap instead of a version would be
>amazing, as it would allow us to quickly do a feature check with a
> binary
>    OR to see if our engine has all necessary features to read a Parquet
> file.
>I agree that having a compatibility matrix in a prominent spot is an
>important thing to have.
>
> Thanks in advance to anyone willing to drive this! I'm happy to give more
> input and collect furth

Re: Repeated fields spec clarification

2024-05-15 Thread Julien Le Dem

+1 The semantics of a row group is that it contains rows and therefore
starts on R=0
I generally echo Ed's sentiment here.

On Wed, May 15, 2024 at 8:01 AM Andrew Lamb  wrote:

> Thank you all -- I have filed
> https://issues.apache.org/jira/browse/PARQUET-2473 to track clarifying the
> spec and will make a PR shortly
>
>
> On Sun, May 12, 2024 at 12:18 AM wish maple 
> wrote:
>
> > IMO when Page V2 is present or PageIndex is enabled, the boundaries
> > should be check[1]
> >
> > [1]
> >
> >
> https://github.com/apache/arrow/blob/d10ebf055a393c94a693097db1dca08ff86745bd/cpp/src/parquet/column_writer.cc#L1235-L1237
> >
> >
> > Jan Finis  于2024年5月11日周六 01:15写道：
> >
> > > Hey Parquet devs,
> > >
> > > I so far thought that Parquet mandates that records start at page
> > > boundaries, i.e., at r-level 0, and we have relied on this fact in some
> > > places of our engine. That means, there cannot be any data page for a
> > > REPEATED column that starts at an r-level > 0, as this would mean that
> a
> > > record would be split between multiple pages.
> > >
> > > I also found the two comments in parquet.thrift:
> > >
> > >   /** Number of rows in this data page. which means pages change on
> > record
> > > > boundaries (r = 0) **/
> > > >   3: required i32 num_rows
> > >
> > >
> > >   /**
> > > >* Index within the RowGroup of the first row of the page; this
> means
> > > > pages
> > > >* change on record boundaries (r = 0).
> > > >*/
> > > >   3: required i64 first_row_index
> > >
> > >
> > > These comments seem to imply that my understanding is correct. However,
> > > they are worded very weakly, not like a mandate but more like a "by the
> > > way" comment.
> > >
> > > I haven't found any other mention of r-levels and page boundaries in
> the
> > > parquet-format repo (maybe I missed them?).
> > >
> > > I recently noticed that pyarrow.parquet splits repeated fields over
> > > multiple pages, so it violates this. This triggers assertions in our
> > > engine, so I want to understand what's the right course of action here.
> > >
> > > So, can we please clarify:
> > > *Does Parquet mandate that pages need to start at r-level 0?*
> > >
> > >- I.e., is a parquet file with a page that starts at an r-level > 0
> > ill
> > >formed? I.e., is this a bug in pyarrow.parquet?
> > >- Or can pages start at r-level 0? If so, then what is the
> > significance
> > >of the comments in parquet.thrift?
> > >
> > >
> > > Cheers,
> > > Jan
> > >
> >
>

Re: [DISCUSS] Propose changing the default branch of the parquet-site repo

2024-05-15 Thread Julien Le Dem

+1

On Wed, May 15, 2024 at 4:15 AM Andrew Lamb  wrote:

> I plan to wait until next week to allow any one else who has an opinion to
> share it here and then assuming no objections will file a ticket with ASF
> Infra.
>
> Andrew
>
> On Sun, May 12, 2024 at 3:57 AM Uwe L. Korn  wrote:
>
> > +1
> >
> > On Sun, May 12, 2024, at 9:31 AM, Gang Wu wrote:
> > > +1
> > >
> > > This makes sense. I was also confused when I had access to
> > > parquet-site for the first time.
> > >
> > > Thanks Andrew!
> > >
> > > Best,
> > > Gang
> > >
> > > On Sun, May 12, 2024 at 3:15 AM Vinoo Ganesh 
> > wrote:
> > >
> > >> +1, this would be great. It's something Xinli and I discussed when we
> > first
> > >> made the website updates, but it ended up falling off of the list. It
> > would
> > >> be great to have this updated.
> > >>
> > >> 
> > >>
> > >>
> > >> On Sat, May 11, 2024 at 8:52 PM Andrew Lamb 
> > >> wrote:
> > >>
> > >> > Hello,
> > >> >
> > >> > I would like to propose changing the default branch of the
> > parquet-site
> > >> > repo from `asf-site` to `production`
> > >> >
> > >> > The `asf-site` branch hosts the static files of the site (aka what
> is
> > >> built
> > >> > from the source in the `development` branch). Thus since it is the
> > >> default
> > >> > branch that is what appears when people open the parquet-site[1]
> repo
> > >> >
> > >> > I made a PR to update the readme in the asf-site branch[2] but I
> > think it
> > >> > would be better if we changed the default branch to production. This
> > >> > requires an INFRA JIRA ticket[2], which I am happy to file, but
> > wanted to
> > >> > discuss here first.
> > >> >
> > >> > Andrew Lamb
> > >> > (Apache DataFusion/Arrow PMC, ASF member)
> > >> >
> > >> > p.s.  my not-so-secret agenda is to improve the adoption of the
> > parquet
> > >> > file format by helping with communication and coordination. The
> > >> > parquet.apache.org website plays a key role in this, and thus I
> want
> > to
> > >> > help lower the barrier to help maintain (and update) it.
> > >> >
> > >> >
> > >> > [1]: https://github.com/apache/parquet-site
> > >> > [2]: https://github.com/apache/parquet-site/pull/57
> > >> > [2]:
> > >> >
> > >> >
> > >>
> >
> https://github.com/apache/infrastructure-asfyaml?tab=readme-ov-file#default_branch
> > >> >
> > >>
> >
>

Re: [C++] Parquet and Arrow overlap

2024-05-14 Thread Julien Le Dem

1. I think we should make it easy for people contributing to the C++
codebase. (which is why I voted for the move at the time)
2. If merging repos removes the need to deal with the circular dependency
between repos issue for the C++ code bases, it does it at the expense of
making it easy to evolve the parquet spec and the java and c++
implementations together.
This setup was optimized for quick iterations on the APIs on the C++ side.
Now that those APIs are more stable, it is less needed IMO.

parquet-cpp depends only on arrow-core that does not have to depend on
parquet-cpp. It really just needs the vectors. Other components like
arrow-dataset and pyarrow can depend on parquet-cpp just like they depend
on orc externally.

I realize that would be work to make it happen, but the current location of
the parquet-cpp codebase is a big trade-off of prioritizing quick iteration
on the C++ implementations over iteration on the format. As interest grows
in evolving the format, I think it warrants a re-evaluation.



On Tue, May 14, 2024 at 9:20 AM Antoine Pitrou  wrote:

>
> Moving Parquet C++ out of Arrow C++ would basically recreate the
> problems that motivated the integration of Parquet C++ into Arrow C++
> :-)
>
> Regards
>
> Antoine.
>
>
> On Tue, 14 May 2024 13:52:15 +0800
> Gang Wu  wrote:
> > IMO, moving parquet-cpp out of arrow is challenging as the dependency
> > chain looks like: arrow core <- parquet-cpp <- arrow dataset <- pyarrow
> >
> > Best,
> > Gang
> >
> > On Tue, May 14, 2024 at 12:38 PM Julien Le Dem <
> julien-1odqgaof3lkdnm+yrof...@public.gmane.org> wrote:
> >
> > > It is great to see more momentum building.
> > > I have myself a little bit more time to contribute to Parquet.
> > >
> > > Personally I think moving it back would make sense.
> > > *However* I have personally contributed a lot more to the Java than
> the C++
> > > code base.
> > > That move was done initially because people contributing to the Arrow
> and
> > > Parquet C++ code bases were the same ones and circular dependencies
> were
> > > getting in the way (does Parquet depend on Arrow or the other way
> around?
> > > At the time it was both ways.). So to make this happen, we need enough
> > > Parquet C++ contributors that would be happy with the move and clarify
> > > which way the dependency goes. My take is that Parquet depends on
> Arrow but
> > > I'd be curious to see what others think.
> > > Julien
> > >
> > > On Sat, May 11, 2024 at 2:51 AM Andrew Lamb 
> > > wrote:
> > >
> > > > It is great to see some additional enthusiasm and momentum around the
> > > > Apache Parquet implementation (congratulations on the release of
> > > parquet-mr
> > > > 1.14[1]!).
> > > >
> > > > As activity picks up, if the desire is to build more community around
> > > > Parquet, perhaps the Parquet PMC wants to encourage moving code back
> to
> > > > repositories managed by parquet (and out of arrow, for example). I
> > > realize
> > > > this would be a technical burden, but it might clarify communities
> and
> > > > committers.
> > > >
> > > > Andrew
> > > >
> > > > [1]:
> https://lists.apache.org/thread/2gggm938z0x9fx3wtwctfm5htsxlf3z4
> > > >
> > > >
> > > >
> > > > On Fri, May 10, 2024 at 11:45 PM Matt Topol 
> > > > wrote:
> > > >
> > > > > I just wanted to also poke the question of non-Java developers who
> have
> > > > > worked on the other parquet implementations potentially being
> > > recognized
> > > > as
> > > > > committers or otherwise on the Parquet project (speaking as the
> primary
> > > > > developer of the Go parquet implementation which also lives in
> the
> > > Arrow
> > > > > repository). It would be great to see some active contributors to
> > > > > parquet-cpp, parquet-go, or otherwise not just being considered but
> > > > > actively becoming committers.
> > > > >
> > > > > That's just my two cents from a community perspective.
> > > > >
> > > > > --Matt
> > > > >
> > > > > On Fri, May 10, 2024, 10:35 PM Jacob Wujciak <
> assignu...@apache.org>
> > > > > wrote:
> > > > >
> > > > > > Thank you, that sounds great! On first glance some seem to be
> rather
> > > > old
&

Re: Interest in Parquet V3

2024-05-14 Thread Julien Le Dem

+1 on Micah starting a doc and following up by commenting in it.

@Raphael, Wish Maple: agreed that changing the metadata representation is
less important. Most engines can externalize and index metadata in some
way. It is an option to propose a standard way to do it without changing
the format. Adding new encodings or make existing encodings more
parallelizable is something that needs to be in the format and more useful.

On Tue, May 14, 2024 at 9:26 AM Antoine Pitrou  wrote:

> On Mon, 13 May 2024 16:10:24 +0100
> Raphael Taylor-Davies
> 
> wrote:
> >
> > I guess I wonder if rather than having a parquet format version 2, or
> > even a parquet format version 3, we could just document what features a
> > given parquet implementation actually supports. I believe Andrew intends
> > to pick up on where previous efforts here left off.
>
> I also believe documenting implementation status is strongly desirable,
> regardless of whether the discussion on "V3" goes anywhere.
>
> Regards
>
> Antoine.
>
>
>

Re: [DISCUSS] Parquet Reference Implementation ?

2024-05-14 Thread Julien Le Dem

I agree that parquet-mr implementation is a requirement to evolve the spec.
It makes sense to me that we call parquet-mr the reference implementation
and make it a requirement to evolve the spec.
I would add the requirement to implement it in the parquet cpp
implementation that lives in apache Arrow:
https://github.com/apache/arrow/tree/main/cpp/src/parquet
This code used to live in the parquet-cpp repo in the Parquet project.
Being language agnostic is an important feature of the format.
Interoperability tests should also be included.

On Tue, May 14, 2024 at 9:31 AM Antoine Pitrou  wrote:

>
> AFAIK, the only Parquet implementation under the Apache Parquet project
> is parquet-mr :-)
>
>
> On Tue, 14 May 2024 10:58:58 +0200
> Rok Mihevc  wrote:
> > Second Raphael's point.
> > Would it be reasonable to say specification change requires
> implementation
> > in two parquet implementations within Apache Parquet project?
> >
> > Rok
> >
> > On Tue, May 14, 2024 at 10:50 AM Gang Wu <
> ustcwg-re5jqeeqqe8avxtiumw...@public.gmane.org> wrote:
> >
> > > IMHO, it looks more reasonable if a reference implementation is
> required
> > > to support most (not all) elements from the specification.
> > >
> > > Another question is: should we discuss (and vote for) each candidate
> > > one by one? We can start with parquet-mr which is most well-known
> > > implementation.
> > >
> > > Best,
> > > Gang
> > >
> > > On Tue, May 14, 2024 at 4:41 PM Raphael Taylor-Davies
> > >  wrote:
> > >
> > > > Potentially it would be helpful to flip the question around. As
> Andrew
> > > > articulates, a reference implementation is required to implement all
> > > > elements from the specification, and therefore the major consequence
> of
> > > > labeling parquet-mr thusly would be that any specification change
> would
> > > > have to be implemented within parquet-mr as part of the
> standardisation
> > > > process. It would be insufficient for it to be implemented in, for
> > > > example, two of the parquet implementations maintained by the arrow
> > > > project. I personally think that would be a shame and likely exclude
> > > > many people who would otherwise be interested in evolving the parquet
> > > > specification, but think that is at the core of this question.
> > > >
> > > > Kind Regards,
> > > >
> > > > Raphael
> > > >
> > > > On 13/05/2024 20:55, Andrew Lamb wrote:
> > > > > Question: Should we label parquet-mr or any other parquet
> > > implementations
> > > > > "reference" implications"?
> > > > >
> > > > > This came up as part of Vinoo's great PR to list different parquet
> > > > > reference implementations[1][2].
> > > > >
> > > > > The term "reference implementation" often has an official
> connotation.
> > > > For
> > > > > example the wikipedia definition is "a program that implements all
> > > > > requirements from a corresponding specification. The reference
> > > > > implementation ... should be considered the "correct" behavior of
> any
> > > > other
> > > > > implementation of it."[3]
> > > > >
> > > > > Given the close association of parquet-mr to the parquet standard,
> it
> > > is
> > > > a
> > > > > natural candidate to label as "reference implementation." However,
> it
> > > is
> > > > > not clear to me if there is consensus that it should be thusly
> labeled.
> > > > >
> > > > > I have a strong opinion that a consensus on this question would be
> very
> > > > > helpful. I don't actually have a strong opinion about the answer
> > > > >
> > > > > Andrew
> > > > >
> > > > >
> > > > >
> > > > > [1]:
> > > >
> https://github.com/apache/parquet-site/pull/53#discussion_r1582882267
> > > > > [2]:
> > > >
> https://github.com/apache/parquet-site/pull/53#discussion_r1598283465
> > > > > [3]:  https://en.wikipedia.org/wiki/Reference_implementation
> > > > >
> > > >
> > >
> >
>
>
>
>

Re: Fwd: [C++] Parquet and Arrow overlap

2024-05-13 Thread Julien Le Dem

It is great to see more momentum building.
I have myself a little bit more time to contribute to Parquet.

Personally I think moving it back would make sense.
*However* I have personally contributed a lot more to the Java than the C++
code base.
That move was done initially because people contributing to the Arrow and
Parquet C++ code bases were the same ones and circular dependencies were
getting in the way (does Parquet depend on Arrow or the other way around?
At the time it was both ways.). So to make this happen, we need enough
Parquet C++ contributors that would be happy with the move and clarify
which way the dependency goes. My take is that Parquet depends on Arrow but
I'd be curious to see what others think.
Julien

On Sat, May 11, 2024 at 2:51 AM Andrew Lamb  wrote:

> It is great to see some additional enthusiasm and momentum around the
> Apache Parquet implementation (congratulations on the release of parquet-mr
> 1.14[1]!).
>
> As activity picks up, if the desire is to build more community around
> Parquet, perhaps the Parquet PMC wants to encourage moving code back to
> repositories managed by parquet (and out of arrow, for example). I realize
> this would be a technical burden, but it might clarify communities and
> committers.
>
> Andrew
>
> [1]: https://lists.apache.org/thread/2gggm938z0x9fx3wtwctfm5htsxlf3z4
>
>
>
> On Fri, May 10, 2024 at 11:45 PM Matt Topol 
> wrote:
>
> > I just wanted to also poke the question of non-Java developers who have
> > worked on the other parquet implementations potentially being recognized
> as
> > committers or otherwise on the Parquet project (speaking as the primary
> > developer of the Go parquet implementation which also lives in the Arrow
> > repository). It would be great to see some active contributors to
> > parquet-cpp, parquet-go, or otherwise not just being considered but
> > actively becoming committers.
> >
> > That's just my two cents from a community perspective.
> >
> > --Matt
> >
> > On Fri, May 10, 2024, 10:35 PM Jacob Wujciak 
> > wrote:
> >
> > > Thank you, that sounds great! On first glance some seem to be rather
> old
> > > and probably don't apply anymore.
> > >
> > > > BTW, do we really need to make a full copy of them to have a mirror
> in
> > > the Arrow GitHub issues?
> > >
> > > I am not sure I understand what you mean? A full copy of the
> > > open/closed/all issues? I'd say only the (remaining) open issues would
> be
> > > fine.
> > > For reference this is what an imported issue looks like:
> > > https://github.com/apache/arrow/issues/30543
> > >
> > > Am Sa., 11. Mai 2024 um 04:09 Uhr schrieb Gang Wu :
> > >
> > > > I can initiate the vote. But before the vote, I think we need to
> > revisit
> > > > the states of all unresolved tickets and close some as needed.
> > > >
> > > > BTW, do we really need to make a full copy of them to have a mirror
> > > > in the Arrow GitHub issues?
> > > >
> > > > I'd like to seek a consensus here before sending the vote.
> > > >
> > > > Best,
> > > > Gang
> > > >
> > > > On Sat, May 11, 2024 at 8:46 AM Jacob Wujciak  >
> > > > wrote:
> > > >
> > > > > Hello Everyone!
> > > > >
> > > > > It seems there is general agreement on this topic, it would be
> great
> > > if a
> > > > > committer/PMC could start a (lazy consensus) procedural vote.
> > > > >
> > > > > I will inquire how to handle the parquet-cpp component in jira
> > (ideally
> > > > > disabling it, not removing).
> > > > > There are currently only ~70 open tickets for parquet-cpp, with the
> > > > change
> > > > > in repo it is probably easier to just move open tickets but I'll
> > leave
> > > > that
> > > > > to Rok who managed the transition of Arrows 20k+ tickets too :D
> > > > >
> > > > > Thanks,
> > > > > Jacob
> > > > >
> > > > > Arrow committer
> > > > >
> > > > > On 2024/04/25 05:31:18 Gang Wu wrote:
> > > > > > I know we have some non-Java committers and PMCs. But after the
> > > > > parquet-cpp
> > > > > > donation, it seems that no one worked on Parquet from arrow (cpp,
> > > rust,
> > > > > go,
> > > > > > etc.)
> > > > > > and other projects are promoted as a Parquet committer. It would
> be
> > > > > > inconvenient
> > > > > > for non-Java Parquet developers to work with
> apache/parquet-format
> > > and
> > > > > > apache/parquet-testing repositories. Furthermore, votes from
> these
> > > > > > developers
> > > > > > are not binding for a format change in the ML.
> > > > > >
> > > > > > Best,
> > > > > > Gang
> > > > > >
> > > > > > On Wed, Apr 24, 2024 at 8:42 PM Uwe L. Korn 
> > > wrote:
> > > > > >
> > > > > > > > Should we consider
> > > > > > > > Parquet developers from other projects than parquet-mr as
> > Parquet
> > > > > > > commuters?
> > > > > > >
> > > > > > > We are doing this (speaking as a Parquet PMC who didn't work on
> > > > > > > parquet-mr, but parquet-cpp).
> > > > > > >
> > > > > > > Best
> > > > > > > Uwe
> > > > > > >
> > > > > > > On Wed, Apr 24, 2024, at 2:38 PM, Gang Wu wrote:
> > > > > > > > +1

Re: Interest in Parquet V3

2024-05-13 Thread Julien Le Dem

It's great to see this thread. Thank you Micah for facilitating
the discussion.

my 2cts:
1. I like the idea of having feature checks rather than an absolute version
number. I am sorry for the confusion created by the V2 moniker. Those were
indeed incremental and backwards compatible additions to the v1 spec and
not a rewrite of the format.

a. It would be great to have a formal release cadence but someone needs to
dedicate time to drive the process.
b. IMO we need an implementer of a query engine to "sponsor" adding a new
feature to the format. They would implement usage at the same time so it
can be validated that additions to the spec achieve the expected perf
improvement in the context of a query engine. For example, some years ago,
Impala was implementing usage of new indexes at the same time they were
specified.
Tracking what engines and versions support the new feature would be useful.
Enough adoption would make it default. This requirement is very different
for a new encoding vs a new additional index or stat.

2. I also think "encoding plugins" are not aligned with the philosophy of
Parquet as the force of the format is to be fully specified cross language
and not just the output of a library.
I do think new encodings and a new metadata representation would be
welcome. Flatbuffer did not exist when I picked Thrift for the footer. The
current metadata representation is a pain to read partially or efficiently.
That said big changes like this need a clear path for adoption and solving
the transition period. teh file does have a magic number "PAR1" at the
beginning and the end that might be used for such incompatible changes at
the metadata layer.

I do think it is easier to integrate more encodings in the ecosystem (say
btrblocks) by adding them to Parquet than by creating a new file format
that would need to build adoption from scratch.

3. Agreed, it is an effort and requires collaboration from key OpenSource
and proprietary engines implementing parquet readers/writers. One way to
facilitate the transition IMO would be to make sure there are native
parquet-arrow implemetations included which is a bit lacking in the java
implementation.

Best
Julien

On Mon, May 13, 2024 at 10:45 AM Micah Kornfield 
wrote:

> Thanks everybody for the input.  I'll try to summarize some main points and
> my thoughts below.
>
> 1.  "V3" branding is problematic and getting adoption is difficult
> with V2.  I agree, we should not lump all potential improvements into a
> single V3 milestone (I used V3 to indicate that at least some changes might
> be backward incompatible with existing format revisions).   In my mind, I
> think the way to make it more likely that new features are used would be
> starting to think about a more formal release process for them.  For
> example:
> a.  A clear cadence of major version library releases (e.g. maybe once
> per year).
> b.  A clear policy for when a new feature becomes the default in a
> library release (e.g. as a strawman once the feature lands in reference
> implementation, it is eligible to become default in the next major release
> that occurs >1 year later).
> c.  For reference implementations that are effectively doing major
> version releases on each release, I think following parquet-mr for flipping
> defaults would make sense.
>
> 2.  How much of the improvements can be a clean slate vs
> evolutionary/implementation optimizations?  I really think this depends on
> which aspects we are tackling. For metadata issues, I think it might pay to
> rethink things from the ground up, but any proposals along these lines
> should obviously have clear rationales and benchmarks to clarify how the
> decisions are made.  For better encodings, most likely work can be added to
> the existing format.  I don't think allowing for arbitrary plugin encodings
> would be a good thing.  I believe one of the reasons that Parquet has been
> successful has been its specification which allows for guaranteed
> compatibility.
>
> 3.  Amount of effort required/Sustainability of effort.  I agree this is a
> big risk. It will take a lot of work to cover the major parquet bindings,
> which is why I started the thread. Personally, I am fairly time constrained
> and unless my employer is willing to approve devoting work hours to the
> project I likely won't be able to contribute much.  However, it seems like
> there might be enough interest from the community that I can potentially
> make the case for doing so.
>
> Thanks,
> Micah
>
> On Mon, May 13, 2024 at 10:41 AM Ed Seidl  wrote:
>
> > I think the whole "V1" vs "V2" mess is unfortunate. IMO there is only one
> > version of the Parquet file format. At its core, the data layout (row
> > groups
> > composed of column chunks composed of Dremel encoded pages) has
> > never changed. Encodings/codecs/structures have been added to that core,
> > but always in a backwards compatible way.
> >
> > I agree that many of the perceived shortcomings might be

Re: Repeated fields spec clarification

2024-05-10 Thread Julien Le Dem

Jan, your understanding of the Parquet spec is correct.
The semantics of "num_rows" and "first_row_index" do require records to
*not* be split across pages.
Push downs and page skipping require this to be true.
I would consider the behavior of splitting a record across pages as a bug
in pyarrow.parquet.
I'd support updating the spec to have stronger language if you think it is
necessary.

On Fri, May 10, 2024 at 11:36 AM Andrew Lamb  wrote:

> We encountered a similar question / issue in the Rust parquet
> implementation[1].
>
> Raphael's conclusion was that pages need to start with r-level 0 if using
> V2 data pages or if there is a page index. Among other reasons, if this
> doesn't hold, it is not possible to do pushdown on nested columns as you
> have no idea where the last record actually ends.
>
> We updated the parquet-rs reader to make this assumption in [2]
>
> If others on this thread agree I would be happy to draft a spec
> clarification on this point
>
> Andrew
>
>
>
>
>
> [1] https://github.com/apache/arrow-rs/issues/3680
> [2] https://github.com/apache/arrow-rs/pull/4943
>
>
>
> On Fri, May 10, 2024 at 1:15 PM Jan Finis  wrote:
>
> > Hey Parquet devs,
> >
> > I so far thought that Parquet mandates that records start at page
> > boundaries, i.e., at r-level 0, and we have relied on this fact in some
> > places of our engine. That means, there cannot be any data page for a
> > REPEATED column that starts at an r-level > 0, as this would mean that a
> > record would be split between multiple pages.
> >
> > I also found the two comments in parquet.thrift:
> >
> >   /** Number of rows in this data page. which means pages change on
> record
> > > boundaries (r = 0) **/
> > >   3: required i32 num_rows
> >
> >
> >   /**
> > >* Index within the RowGroup of the first row of the page; this means
> > > pages
> > >* change on record boundaries (r = 0).
> > >*/
> > >   3: required i64 first_row_index
> >
> >
> > These comments seem to imply that my understanding is correct. However,
> > they are worded very weakly, not like a mandate but more like a "by the
> > way" comment.
> >
> > I haven't found any other mention of r-levels and page boundaries in the
> > parquet-format repo (maybe I missed them?).
> >
> > I recently noticed that pyarrow.parquet splits repeated fields over
> > multiple pages, so it violates this. This triggers assertions in our
> > engine, so I want to understand what's the right course of action here.
> >
> > So, can we please clarify:
> > *Does Parquet mandate that pages need to start at r-level 0?*
> >
> >- I.e., is a parquet file with a page that starts at an r-level > 0
> ill
> >formed? I.e., is this a bug in pyarrow.parquet?
> >- Or can pages start at r-level 0? If so, then what is the
> significance
> >of the comments in parquet.thrift?
> >
> >
> > Cheers,
> > Jan
> >
>

Re: [Request] Send automated notifications to a separate mailing-list

2023-08-21 Thread Julien Le Dem

+1

On Mon, Aug 21, 2023 at 10:16 AM Antoine Pitrou  wrote:

>
> Hello,
>
> I would like to request that automated notifications (from GitHub,
> Jira... whatever) be sent to a separate mailing-list and GMane mirror.
> Currently, the endless stream of automated notifications in this
> mailing-list means that discussions between humans quickly get lost or
> even unnoticed by other people.
>
> For the record, we did this move in Apache Arrow and never came back.
>
> Thanks in advance
>
> Antoine.
>
>
>

Re: [VOTE] Release Apache Parquet 1.12.1 RC1

2021-09-14 Thread Julien Le Dem

+1 (binding)
I verified the signature
the build and tests pass (with java 8)

On Tue, Sep 14, 2021 at 4:14 PM Xinli shang  wrote:

> I also vote +1 (binding). Thanks everybody for verifying!
>
> On Tue, Sep 14, 2021 at 2:00 PM Chao Sun  wrote:
>
> > +1 (non-binding).
> >
> > - tested on the Spark side and all tests passed, including the issue in
> > SPARK-36696
> > - verified signature and checksum of the release
> >
> > Thanks Xinli for driving the release work!
> >
> > Chao
> >
> > On Tue, Sep 14, 2021 at 3:01 AM Gabor Szadovszky 
> wrote:
> >
> > > Thanks for the new RC, Xinli.
> > >
> > > The content seems correct to me. The checksum and sign are correct.
> Unit
> > > tests pass.
> > >
> > > My vote is +1 (binding)
> > >
> > > On Mon, Sep 13, 2021 at 8:11 PM Xinli shang 
> > > wrote:
> > >
> > > > Hi everyone,
> > > >
> > > >
> > > > I propose the following RC to be released as the official Apache
> > Parquet
> > > > 1.12.1 release.
> > > >
> > > >
> > > > The commit id is 2a5c06c58fa987f85aa22170be14d927d5ff6e7d
> > > >
> > > > * This corresponds to the tag: apache-parquet-1.12.1-rc1
> > > >
> > > > *
> > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/tree/2a5c06c58fa987f85aa22170be14d927d5ff6e7d
> > > >
> > > >
> > > > The release tarball, signature, and checksums are here:
> > > >
> > > > *
> > > >
> > >
> >
> https://dist.apache.org/repos/dist/dev/parquet/apache-parquet-1.12.1-rc1/
> > > >
> > > >
> > > > You can find the KEYS file here:
> > > >
> > > > * *https://dist.apache.org/repos/dist/release/parquet/KEYS
> > > > *
> > > >
> > > >
> > > > Binary artifacts are staged in Nexus here:
> > > >
> > > > *
> > >
> https://repository.apache.org/content/groups/staging/org/apache/parquet/
> > > >
> > > >
> > > > This release includes important changes listed
> > > > https://github.com/apache/parquet-mr/blob/parquet-1.12.x/CHANGES.md
> > > >
> > > >
> > > > Please download, verify, and test.
> > > >
> > > >
> > > > Please vote in the next 72 hours.
> > > >
> > > >
> > > > [ ] +1 Release this as Apache Parquet 1.12.1
> > > >
> > > > [ ] +0
> > > >
> > > > [ ] -1 Do not release this because...
> > > >
> > > > --
> > > > Xinli Shang | Tech Lead Manager @ Uber Data Infra
> > > >
> > >
> >
>
>
> --
> Xinli Shang
>

New Parquet PMC chair

2021-05-28 Thread Julien Le Dem

Hello Parquet community,
The Parquet PMC discussed and decided some time ago to move to a rotating
chair.
Every year around this time the PMC will elect a new chair to represent the
project to the board.
I'm happy to announce that Xinli Shang is the first to be elected to be VP
Apache Parquet since the inception of the project.
Xinli has been driving several community efforts and is instrumental to the
project.
Please join me in congratulating him.
congrats Xinli!
Julien
- former Parquet PMC chair

Re: [VOTE] Release Apache Parquet 1.12.0 RC4

2021-03-24 Thread Julien Le Dem

+1 (binding)
I verified the signature and built from source.
All tests pass.
Looks good.

On Wed, Mar 24, 2021 at 2:07 AM Gabor Szadovszky  wrote:

> I currently have the feeling that the Avro/Jackson related issue has been
> discussed and the community agrees on moving forward with this RC as is
> (without upgrading the Avro and the Jackson dependencies).
> So, I'm giving my +1 (binding) vote.
>
> On Tue, Mar 23, 2021 at 9:28 PM Aaron Niskode-Dossett
>  wrote:
>
> > +1 (non-binding)
> >
> > - cloned the 1.12.0-rc-4 tag from github
> > - compiled jars locally and all tests passed
> > - used the 1.12.0 jars as dependencies for a local application that
> streams
> > data into protobuf-parquet files
> > - confirmed data is correct and can be read with parquet-tools compiled
> > from parquet 1.11.1
> >
> > On Tue, Mar 23, 2021 at 10:47 AM Xinli shang 
> > wrote:
> >
> > > Let's discuss it in today's community sync meeting.
> > >
> > > On Tue, Mar 23, 2021 at 8:37 AM Aaron Niskode-Dossett
> > >  wrote:
> > >
> > > > Gabor and Ismaël, thank you both for the very clear explanations of
> > > what's
> > > > going on.
> > > >
> > > > Based on Gabor's description of avro compatibility I would be +1
> > > > (non-binding) for the current RC.
> > > >
> > > > On Tue, Mar 23, 2021 at 4:36 AM Gabor Szadovszky 
> > > wrote:
> > > >
> > > > > Thanks, Ismaël for the explanation. I have a couple of notes about
> > your
> > > > > concerns.
> > > > >
> > > > > - Parquet 1.12.0 as per the semantic versioning is not a major but
> a
> > > > minor
> > > > > release. (It is different from the Avro versioning strategy where
> the
> > > > > second version number means major version changes.)
> > > > > - The jackson dependency is shaded in the parquet jars so the
> > > > > synchronization of the version is not needed (and not even
> possible).
> > > > > - Using the latest Avro version makes sense but if we do not use it
> > for
> > > > the
> > > > > current release it should not cause any issues in our clients.
> Let's
> > > > check
> > > > > the following example. We upgrade to the latest 1.10.2 Avro release
> > in
> > > > > parquet then release it under 1.12.0. Later on Avro creates a new
> > > release
> > > > > (e.g. 1.10.3 or even 1.11.0) while Parquet does not. In this case
> our
> > > > > clients need to upgrade Avro without Parquet. If it is a major Avro
> > > > release
> > > > > it might occur that the Parquet code has to be updated but usually
> it
> > > is
> > > > > not the case. (The last time we've had to change production code
> for
> > an
> > > > > Avro upgrade was from 1.7.6 to 1.8.0.) I think our clients should
> be
> > > able
> > > > > to upgrade Avro independently from Parquet and vice versa (until
> > there
> > > > are
> > > > > incompatibility issues). I would even change Parquet's Avro
> > dependency
> > > to
> > > > > "provided" but that might be a breaking change and clearly won't do
> > it
> > > > just
> > > > > before the release.
> > > > >
> > > > > What do you think? Anyone have a strong opinion about this topic?
> > > > >
> > > > > Cheers,
> > > > > Gabor
> > > > >
> > > > > On Mon, Mar 22, 2021 at 6:31 PM Ismaël Mejía 
> > > wrote:
> > > > >
> > > > > > Sure. The Avro upgrade feature/API wise is minor for Parquet, so
> > the
> > > > > > possibility of adding a regression is really REALLY minor. The
> > hidden
> > > > > issue
> > > > > > is the new transitive dependencies introduced by Avro, concretely
> > > > Jackson
> > > > > > 2.12.2.
> > > > > >
> > > > > > Since Parquet 1.12.0 is a major version it is probably a good
> > moment
> > > to
> > > > > > upgrade Jackson too that's why I opened [1] (already merged). In
> > > > > particular
> > > > > > now that Spark merged support for both Avro 1.10.2 [1] and
> Jackson
> > > > 2.12.2
> > > > > > [2] for the upcoming 3.2.0 release, so now Spark can easily bring
> > > > > upgraded
> > > > > > Parquet too with all the dependencies well aligned. This of
> course
> > is
> > > > > not a
> > > > > > blocker for the release or for other downstream projects but it
> > might
> > > > > help
> > > > > > to make their life better because they will have less dependency
> > > > > alignment
> > > > > > issues to battle.
> > > > > >
> > > > > > Ismaël
> > > > > >
> > > > > > [1] https://github.com/apache/parquet-mr/pull/883
> > > > > > [2] https://github.com/apache/spark/pull/31866
> > > > > > [3] https://github.com/apache/spark/pull/31878
> > > > > >
> > > > > >
> > > > > > On Mon, Mar 22, 2021, 3:37 PM Xinli shang
>  > >
> > > > > wrote:
> > > > > >
> > > > > > > Hi Ismaël,
> > > > > > >
> > > > > > > Can you explain a little bit more on if we don't upgrade in
> this
> > > > > release,
> > > > > > > what could be the worst-case scenario for the ecosystem? The
> > > > > last-minute
> > > > > > > upgrading seems a rush to me but I would like to hear what are
> > the
> > > > > impact
> > > > > > > if we don't.  As Gabor mentioned, this should not be a
> > > show-stopper.
> > >

[jira] [Commented] (PARQUET-1666) Remove Unused Modules

2020-12-02 Thread Julien Le Dem (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-1666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17242864#comment-17242864
 ] 

Julien Le Dem commented on PARQUET-1666:


that sounds good to me too

> Remove Unused Modules 
> --
>
> Key: PARQUET-1666
> URL: https://issues.apache.org/jira/browse/PARQUET-1666
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Xinli Shang
>Priority: Major
> Fix For: 1.12.0
>
>
> In the last two meetings, Ryan Blue proposed to remove some unused Parquet 
> modules. This is to open a task to track it. 
> Here are the related meeting notes for the discussion on this. 
> Remove old Parquet modules
> Hive modules - sounds good
> Scooge - Julien will reach out to twitter
> Tools - undecided - Cloudera may still use the parquet-tools according to 
> Gabor.
> Cascading - undecided
> We can change the module as deprecated as description.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[ANNOUNCE] New Parquet PMC member - Xinli Shang

2020-11-09 Thread Julien Le Dem

On behalf of the Apache Parquet PMC, I'm happy to announce that Xinli Shang
has accepted to join the PMC.

Congrats Xinli!

Re: Metadata summary file deprecation

2020-09-29 Thread Julien Le Dem

Hi Jason,
Thank you for bringing this up.
A correctness issue would only come up when more parquet files are added to
the same folder or files are modified. Historically folders have been
considered immutables and the summary file reflects the metadata for all
the files in the folder. The summary file contains the names of the files
it is for, so extra files in the folder can also be detected and dealt with
at read time without correctness issues.
Like you mentioned the read path allows for those files to not be present.
I think a better solution than deprecating would be to have a switch
allowing turning off those summary files when one wants to be able to not
respect the immutable folder contact.
Projects like Iceberg can elect to not produce them and allow modifying and
adding more files to the same folder without creating correctness problems.
I would be in favor of removing those Deprecated annotations and document
the use of a switch to optionally not produce the summary files when
electing to modify folders.
I'm curious to hear from Ryan about this who did the change in the first
place.
Best,
Julien

On Fri, Sep 25, 2020 at 3:06 PM Jason Altekruse 
wrote:

> Hy Jacques,
>
> It's good to hear from you, thanks for the pointer to Iceberg. I am aware
> of it as well as other similar projects, including Delta Lake, which my
> team is already using. Unfortunately even with Delta, there is only a
> placeholder in the project currently where they will be tracking file level
> statistics at some point in the future, we are also evaluating the
> possibility of implementing this in delta itself. While it and Iceberg
> aren't quite the same architecturally, I think there is enough overlap that
> it might be a bit awkward to use the two in conjunction with one another.
>
> From my testing so far, it appears that delta pretty easily can operate
> alongside these older metadata summary files without the two fighting with
> each other. Delta is responsible for maintaining a transactionally
> consistent list of files, and this file can coexist in the directory just
> to allow efficient pruning on the driver side on a best effort basis, as it
> can gracefully fall back to the FS if it is missing a newer file.
>
> We are somewhat nervous about depending on something that is marked
> deprecated, but as it is so close to a "just works" state for our needs, I
> was hoping to confirm with the community if there were other risks I was
> missing.
>
> Jason Altekruse
>
> On Wed, Sep 23, 2020 at 6:29 PM Jacques Nadeau  wrote:
>
> > Hey Jason,
> >
> > I'd suggest you look at Apache Iceberg. It is a much more mature way of
> > handling metadata efficiency issues and provides a substantial superset
> of
> > functionality over the old metadata cache files.
> >
> > On Wed, Sep 23, 2020 at 4:16 PM Jason Altekruse <
> altekruseja...@gmail.com>
> > wrote:
> >
> > > Hello again,
> > >
> > > I took a look through the mail archives and found a little more
> > information
> > > in this and a few other threads.
> > >
> > >
> > >
> >
> http://mail-archives.apache.org/mod_mbox//parquet-dev/201707.mbox/%3CCAO4re1k8-bZZZWBRuLCnm1V7AoJE1fdunSuBn%2BecRuFGPgcXnA%40mail.gmail.com%3E
> > >
> > > While I do understand the benefits for federating out the reading of
> > > footers for the sake of not worrying about synchronization between the
> > > cached metadata and any changes to the files on disk, it does appear
> > there
> > > is still a use case that isn't solved well with this design, needle in
> a
> > > haystack selective filter queries, where the data is sorted by the
> filter
> > > column. For example in the tests I ran with queries against lots of
> > parquet
> > > files where the vast majority are pruned by a bunch of small tasks, it
> > > takes 33 seconds vs just 1-2 seconds with driver side pruning using the
> > > summary file (requires a small spark changet).
> > >
> > > In our use case we are never going to be replacing contents of existing
> > > parquet files (with a delete and rewrite on HDFS) or appending new row
> > > groups onto existing files. In that case I don't believe we should
> > > experience any correctness problems, but I wanted to confirm if there
> is
> > > something I am missing. I am
> > > using readAllFootersInParallelUsingSummaryFiles which does fall back to
> > > read individual footers if they are missing from the summary file.
> > >
> > > I am also curious if a solution to the correctness problems could be to
> > > include a file length and/or last modified time into the summary file,
> > > which could confirm against FS metadata that the files on disk are
> still
> > in
> > > sync with the metadata summary relatively quickly. Would it be possible
> > to
> > > consider avoiding this deprecation if I was to work on an update to
> > > implement this?
> > >
> > > - Jason Altekruse
> > >
> > >
> > > On Tue, Sep 15, 2020 at 8:52 PM Jason Altekruse <
> > altekruseja...@gmail.com>
> > > wrote:
> > >
> > > > Hello all,

[Announce] new committer: Xinli Shang

2020-03-12 Thread Julien Le Dem

The Project Management Committee (PMC) for Apache Parquet
has invited Xinli Shang to become a committer and we are pleased
to announce that he has accepted.

Welcome Xinli!

[jira] [Assigned] (PARQUET-1777) add Parquet logo vector files to repo

2020-01-24 Thread Julien Le Dem (Jira)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem reassigned PARQUET-1777:
--

Assignee: Julien Le Dem

> add Parquet logo vector files to repo
> -
>
> Key: PARQUET-1777
> URL: https://issues.apache.org/jira/browse/PARQUET-1777
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-format
>        Reporter: Julien Le Dem
>        Assignee: Julien Le Dem
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (PARQUET-1777) add Parquet logo vector files to repo

2020-01-24 Thread Julien Le Dem (Jira)

Julien Le Dem created PARQUET-1777:
--

 Summary: add Parquet logo vector files to repo
 Key: PARQUET-1777
 URL: https://issues.apache.org/jira/browse/PARQUET-1777
 Project: Parquet
  Issue Type: Task
  Components: parquet-format
Reporter: Julien Le Dem






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Re: [VOTE] Release Apache Parquet 1.11.0 RC7

2019-12-05 Thread Julien Le Dem

I verified the signatures
ran the build and test
It looks like the compatibility changes being discussed are not blockers.

+1 (binding)


On Wed, Nov 27, 2019 at 1:43 AM Gabor Szadovszky  wrote:

> Thanks, Zoltan.
>
> I also vote +1 (binding)
>
> Cheers,
> Gabor
>
> On Tue, Nov 26, 2019 at 1:46 PM Zoltan Ivanfi 
> wrote:
>
> > +1 (binding)
> >
> > - I have read through the problem reports in this e-mail thread (one
> caused
> > by the use of a private method via reflection an another one caused by
> > having mixed versions of the libraries on the classpath) and I am
> convinced
> > that they do not block the release.
> > - Signature and hash of the source tarball are valid.
> > - The specified git hash matches the specified git tag.
> > - The contents of the source tarball match the contents of the git repo
> at
> > the specified tag.
> >
> > Br,
> >
> > Zoltan
> >
> >
> > On Tue, Nov 26, 2019 at 10:54 AM Gabor Szadovszky 
> > wrote:
> >
> > > Created https://issues.apache.org/jira/browse/PARQUET-1703 to track
> > this.
> > >
> > > Back to the RC. Anyone from the PMC willing to vote?
> > >
> > > Cheers,
> > > Gabor
> > >
> > > On Mon, Nov 25, 2019 at 6:45 PM Ryan Blue 
> > > wrote:
> > >
> > > > Gabor, good point about not being able to check new APIs. Updating
> the
> > > > previous version would also allow us to get rid of temporary
> > exclusions,
> > > > like the one you pointed out for schema. It would be great to improve
> > > what
> > > > we catch there.
> > > >
> > > > On Mon, Nov 25, 2019 at 1:56 AM Gabor Szadovszky 
> > > wrote:
> > > >
> > > > > Hi Ryan,
> > > > >
> > > > > It is a different topic but would like to reflect shortly.
> > > > > I understand that 1.7.0 was the first apache release. The problem
> > with
> > > > > doing the compatibility checks comparing to 1.7.0 is that we can
> > easily
> > > > add
> > > > > incompatibilities in API which are added after 1.7.0. For example:
> > > > Adding a
> > > > > new class for public use in 1.8.0 then removing it in 1.9.0. The
> > > > > compatibility check would not discover this breaking change. So, I
> > > > think, a
> > > > > better approach would be to always compare to the previous minor
> > > release
> > > > > (e.g. comparing 1.9.0 to 1.8.0 etc.).
> > > > > As I've written before, even org/apache/parquet/schema/** is
> excluded
> > > > from
> > > > > the compatibility check. As far as I know this is public API. So, I
> > am
> > > > not
> > > > > sure that only packages that are not part of the public API are
> > > excluded.
> > > > >
> > > > > Let's discuss this on the next parquet sync.
> > > > >
> > > > > Regards,
> > > > > Gabor
> > > > >
> > > > > On Fri, Nov 22, 2019 at 6:20 PM Ryan Blue
>  > >
> > > > > wrote:
> > > > >
> > > > > > Gabor,
> > > > > >
> > > > > > 1.7.0 was the first version using the org.apache.parquet
> packages,
> > so
> > > > > > that's the correct base version for compatibility checks. The
> > > > exclusions
> > > > > in
> > > > > > the POM are classes that the Parquet community does not consider
> > > > public.
> > > > > We
> > > > > > rely on these checks to highlight binary incompatibilities, and
> > then
> > > we
> > > > > > discuss them on this list or in the dev sync. If the class is
> > > internal,
> > > > > we
> > > > > > add an exclusion for it.
> > > > > >
> > > > > > I know you're familiar with this process since we've talked about
> > it
> > > > > > before. I also know that you'd rather have more strict binary
> > > > > > compatibility, but until we have someone with the time to do some
> > > > > > maintenance and build a public API module, I'm afraid that's what
> > we
> > > > have
> > > > > > to work with.
> > > > > >
> > > > > > Michael,
> > > > > >
> > > > > > I hope the context above is helpful and explains why running a
> > binary
> > > > > > compatibility check tool will find incompatible changes. We allow
> > > > binary
> > > > > > incompatible changes to internal classes and modules, like
> > > > > parquet-common.
> > > > > >
> > > > > > On Fri, Nov 22, 2019 at 12:23 AM Gabor Szadovszky <
> > ga...@apache.org>
> > > > > > wrote:
> > > > > >
> > > > > > > Ryan,
> > > > > > > I would not trust our compatibility checks (semver) too much.
> > > > > Currently,
> > > > > > it
> > > > > > > is configured to compare our current version to 1.7.0. It means
> > > > > anything
> > > > > > > that is added since 1.7.0 and then broke in a later release
> won't
> > > be
> > > > > > > caught. In addition, many packages are excluded from the check
> > > > because
> > > > > of
> > > > > > > different reasons. For example org/apache/parquet/schema/** is
> > > > excluded
> > > > > > so
> > > > > > > if it would really be an API compatibility issue we certainly
> > > > wouldn't
> > > > > > > catch it.
> > > > > > >
> > > > > > > Michael,
> > > > > > > It fails because of a NoSuchMethodError pointing to a method
> that
> > > is
> > > > > > newly
> > > > > > > introduced in 1.11. Both the caller and the callee

Re: Parquet sync zoom - invalid meeting ID

2019-11-21 Thread Julien Le Dem

that worked, thanks!

On Thu, Nov 21, 2019 at 9:11 AM Xinli shang  wrote:

> Can you try https://uber.zoom.us/j/142456544?
>
> On Thu, Nov 21, 2019 at 9:07 AM Gabor Szadovszky  wrote:
>
> > Hi,
> >
> > Is it just me who cannot join to the meeting? It says "Invalid meeting
> > ID"...
> >
> > Cheers,
> > Gabor
> >
>
>
> --
> Xinli Shang
>

Re: Parquet sync zoom - invalid meeting ID

2019-11-21 Thread Julien Le Dem

same for me. can someone send a new link?

On Thu, Nov 21, 2019 at 9:08 AM Jim Apple  wrote:

> The same is happening to me. Additionally, one of the toll-free phone
> numbers did not pick up.
>
> No outages I see: https://statusgator.com/services/zoom,
> https://status.zoom.us/
>
> On 2019/11/21 17:06:56, Gabor Szadovszky  wrote:
> > Hi,
> >
> > Is it just me who cannot join to the meeting? It says "Invalid meeting
> > ID"...
> >
> > Cheers,
> > Gabor
> >
>

Re: Parquet Sync - 10/17/2019 - Meeting Notes

2019-10-17 Thread Julien Le Dem

Thanks for the notes. Sorry I missed the sync because of a conflict.

On Thu, Oct 17, 2019 at 10:00 AM Gidon Gershinsky  wrote:

> A slight correction re C++. I said the following
> C++ work is near completion/merge. Deepark has reviewed it and made
> additional changes / refactoring.
>
> On Thu, Oct 17, 2019 at 7:33 PM  wrote:
>
>> 10/17/2019
>>
>> Attendee:
>> Gidon
>> Gabor
>> Ryan
>> Karfiol
>> Xinli
>>
>> Topics:
>>
>> Column Encryption
>> For C++ version, Gidon worked with Deepak to have reviews going on.
>> For Java, we are blocked on the Parquet-11 release. Gabor proposed to
>> have branch the Parquet-11 and then merge later. But we would need to be in
>> master as the final step.
>>
>> Bloom Filter
>> Next step is to wait for the Parquet-11 release
>>
>> Parquet-11 Validation
>> Ryan - the release can go ahead without me if there are enough PMCs
>> Gabor - I will try to push this effort in a couple of weeks
>>
>> Ongoing Parquet Work
>> There is some work to create PR shortly to optimize Parquet usage of byte
>> buffer, ParallelIzation, S3 reading, etc.
>>
>> Xinli Shang (Uber)
>>
>> Parquet Sync - Monthly(every 3rd thursday)
>> Hi all,
>>
>> This is an invitation for the next occasion of the regular sync meeting
>> of the Parquet community.
>>
>> Xinli Shang
>>
>> Join Zoom Meeting
>> https://uber.zoom.us/j/112318682
>> <https://www.google.com/url?q=https%3A%2F%2Fuber.zoom.us%2Fj%2F112318682=D=157176203650=AFQjCNFLMC07ke-eJw4pjYYwN8FsCtnzMA>
>>
>> One tap mobile
>> +16699006833,,112318682# US (San Jose)
>> +16468769923,,112318682# US (New York)
>>
>> Dial by your location
>> +1 669 900 6833 US (San Jose)
>> +1 646 876 9923 US (New York)
>> 855 880 1246 US Toll-free
>> 877 369 0926 US Toll-free
>> Meeting ID: 112 318 682
>> Find your local number: https://zoom.us/u/aZKZunOZ9
>> <https://www.google.com/url?q=https%3A%2F%2Fzoom.us%2Fu%2FaZKZunOZ9=D=1571762036501000=AFQjCNH-GaE-r1Fxk3iTw0cJgIRhbUP1JA>
>>
>> Join by SIP
>> 112318...@zoomcrc.com
>>
>> Join by H.323
>> 162.255.37.11 (US West)
>> 162.255.36.11 (US East)
>> 221.122.88.195 (China)
>> 115.114.131.7 (India)
>> 213.19.144.110 (EMEA)
>> 103.122.166.55 (Australia)
>> 209.9.211.110 (Hong Kong)
>> 64.211.144.160 (Brazil)
>> 69.174.57.160 (Canada)
>> 207.226.132.110 (Japan)
>> Meeting ID: 112 318 682
>> *When*
>> Thu Oct 17, 2019 9am – 10am Pacific Time - Los Angeles
>>
>> *Where*
>> https://uber.zoom.us/j/112318682, SEA | 1191 2nd Ave-8th-Whidbey (7)
>> [Zoom] (map
>> <https://www.google.com/maps/search/https:%2F%2Fuber.zoom.us%2Fj%2F112318682,+SEA+%7C+1191+2nd+Ave-8th-Whidbey+%287%29+%5BZoom%5D?hl=en>
>> )
>>
>> *Who*
>> •
>> sha...@uber.com - organizer
>> •
>> shri.hariharasubrahman...@oracle.com
>> •
>> non...@gmail.com
>> •
>> robe...@palantir.com
>> •
>> szonyi.a...@gmail.com
>> •
>> szo...@cloudera.com
>> •
>> m.lac...@criteo.com
>> •
>> csringho...@cloudera.com
>> •
>> rzam...@nvidia.com
>> •
>> borokna...@cloudera.com
>> •
>> bikramjeet@cloudera.com
>> •
>> dev@parquet.apache.org
>> •
>> daniels...@gmail.com
>> •
>> smanik...@gmail.com
>> •
>> nkol...@cloudera.com
>> •
>> ven...@uber.com
>> •
>> q...@criteo.com
>> •
>> jimmyjc...@tencent.com
>> •
>> vercego...@cloudera.com
>> •
>> Xu, Cheng A
>> •
>> aniket...@gmail.com
>> •
>> jbap...@cloudera.com
>> •
>> Julien Le Dem
>> •
>> apha...@cloudera.com
>> •
>> yalia...@twitter.com
>> •
>> marc...@gmail.com
>> •
>> mark.ma...@kognitio.com
>> •
>> santlal.gu...@bitwiseglobal.com
>> •
>> Daniel Weeks
>> •
>> wesmck...@gmail.com
>> •
>> sunc...@apache.org
>> •
>> j.cof...@criteo.com
>> •
>> Reynold Xin
>> •
>> Ryan Blue
>> •
>> Lars Volker
>> •
>> alexleven...@twitter.com
>> •
>> jacq...@apache.org
>> •
>> Sergio Pena
>> •
>> gg5...@gmail.com
>> •
>> fnoth...@berkeley.edu
>> •
>> lukas.naleze...@gmail.com
>> •
>> m.li...@criteo.com
>> •
>> stak...@cloudera.com
>> •
>> s...@yelp.com
>> •
>> o.kaidan...@criteo.com
>> •
>> altekruseja...@gmail.com
>> •
>> brian.bow...@sas.com
>> •
>> julien.le...@gmail.com
>> •
>> Mohammad Islam
>> •
>> gabor.szadovs...@cloudera.com
>> •
>> andy.gr...@rms.com
>> •
>> Wei Han
>> •
>> yumw...@ebay.com
>> •
>> bal...@uber.com
>> •
>> ippokra...@gmail.com
>> •
>> Pavi Subenderan
>> •
>> Zoltan Ivanfi
>> •
>> b.hano...@criteo.com
>> •
>> dam6...@gmail.com
>> •
>> majeti.dee...@gmail.com
>> •
>> Parth Chandra
>> •
>> Mohit Sabharwal
>> •
>> nilangekar.po...@gmail.com
>> •
>> xho...@gmail.com
>> •
>> gorec...@amazon.com
>>
>

Re: [VOTE] Release Apache Parquet Format 2.7.0 RC0

2019-09-26 Thread Julien Le Dem

verified signature, build, ran tests
+1

For information:
You can verify the signature by following:
https://httpd.apache.org/dev/verification.html
(import the KEYS file listed in the email)
To build on a mac:

 brew install maven

 brew install thrift

 mvn test

 mvn package



On Thu, Sep 26, 2019 at 1:59 AM Driesprong, Fokko 
wrote:

> Checked signature and checksums
>
> +1 (non-binding)
>
> Cheers, Fokko
>
> Op do 26 sep. 2019 om 10:16 schreef Gabor Szadovszky
> :
>
> > Checksums/signatures are correct. Tarball content is correct. Unit tests
> > pass.
> >
> > +1 (binding)
> >
> > On Thu, Sep 26, 2019 at 6:02 AM 俊杰陈  wrote:
> >
> > > +1,  downloaded, verified the signature key ID is A4B2E9B5 which is
> > > from Ryan, ran mvn install successfully.
> > >
> > > On Thu, Sep 26, 2019 at 11:20 AM Jim Apple  wrote:
> > > >
> > > > Hi everyone,
> > > >
> > > > I propose the following RC to be released as the official Apache
> > Parquet
> > > Format 2.7.0 release.
> > > >
> > > > The commit id is ee5cae066ed602bd969024eb308c5262c451b6cd
> > > > * This corresponds to the tag: apache-parquet-format-2.7.0
> > > > *
> > >
> >
> https://github.com/apache/parquet-format/tree/ee5cae066ed602bd969024eb308c5262c451b6cd
> > > >
> > > > The release tarball, signature, and checksum are here:
> > > > *
> > >
> >
> https://dist.apache.org/repos/dist/dev/parquet/apache-parquet-format-2.7.0-rc0/
> > > >
> > > > Ryan Blue prepared the artifacts. You can find his key in the KEYS
> file
> > > here:
> > > > * https://dist.apache.org/repos/dist/dev/parquet/KEYS
> > > >
> > > > Binary artifacts are staged in Nexus here:
> > > > *
> > >
> >
> https://repository.apache.org/content/groups/staging/org/apache/parquet/parquet-format/2.7.0/
> > > >
> > > > This release includes important changes visible here:
> > >
> >
> https://github.com/apache/parquet-format/blob/ee5cae066ed602bd969024eb308c5262c451b6cd/CHANGES.md
> > > >
> > > > Please download, verify, and test.
> > > >
> > > > Please vote by Sat Sep 28 21:00:00 PDT 2019, aka
> > > 2019-09-29T04:00:00+00:00
> > > >
> > > > [ ] +1 Release this as Apache Parquet Format 2.7.0
> > > > [ ] +0
> > > > [ ] -1 Do not release this because...
> > >
> > >
> > >
> > > --
> > > Thanks & Best Regards
> > >
> >
>

Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet

2019-08-29 Thread Julien Le Dem

I think this looks promising to me. At first glance it seems combining
simplicity and efficiency.
I'd like to hear more from other members of the PMC.

On Tue, Aug 27, 2019 at 5:30 AM Radev, Martin  wrote:

> Dear all,
>
>
> there was some earlier discussion on adding a new encoding for better
> compression of FP32 and FP64 data.
>
>
> The pull request which extends the format is here:
> https://github.com/apache/parquet-format/pull/144
> The change has one approval from earlier from Zoltan.
>
>
> The results from an investigation on compression ratio and speed with the
> new encoding vs other encodings is available here:
> https://github.com/martinradev/arrow-fp-compression-bench
> It is visible that for many tests the new encoding performs better in
> compression ratio and in some cases in speed. The improvements in
> compression speed come from the fact that the new format can potentially
> lead to a faster parsing for some compressors like GZIP.
>
>
> An earlier report which examines other FP compressors (fpzip, spdp, fpc,
> zfp, sz) and new potential encodings is available here:
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing
> The report also covers lossy compression but the BYTE_STREAM_SPLIT
> encoding only has the focus of lossless compression.
>
>
> Can we have a vote?
>
>
> Regards,
>
> Martin
>
>

Re: Writing INT96 timestamp in parquet from either avro/protobuf records

2019-05-10 Thread Julien Le Dem

Hi Arup,
You are correct, you would have to use the lower level APIs or contribute
the int96 support to either protobuf or avro integrations.
However we are recommending users to migrate away from the int96 type so I
would not recommend adding that support.
https://issues.apache.org/jira/browse/PARQUET-323
Maybe check how the tools you use to query that data interpret int96 and
int64, you might have a better solution moving to the new type and it being
compatible.

On Fri, May 3, 2019 at 11:34 AM Arup Malakar  wrote:

> Following up on the thread, my current understanding is that INT96 is not a
> native type in either of protobuf/avro, so the corresponding high level
> parquet writers don’t support that. But `INT96` is supported by low level
> parquet writer apis. I was able to generate parquet files with INT96 using
> examples from:
>
> https://stackoverflow.com/questions/54657496/how-to-write-timestamp-logical-type-int96-to-parquet-using-parquetwriter
>
> Arup
>
> On Wed, May 1, 2019 at 7:32 PM Arup Malakar  wrote:
>
> > Hi parquet-dev,
> >
> > We have existing parquet files which were generated from json using hive,
> > where timestamps live as INT96. We are changing the pipeline where we are
> > planning to use flink to generate parquet files from protobuf (or avro)
> > using flink's StreamingFileSink. But from my research I am unable to
> find a
> > way to write INT96 columns in the parquet either from avro or protobuf.
> We
> > would like to keep the same datatype on disk for historical and new data
> so
> > would like to stick to INT96, any suggestion how to achieve that?
> >
> > --
> > Arup Malakar
> >
>
>
> --
> Arup Malakar
>

Re: Parquet Sync

2019-04-15 Thread Julien Le Dem

It would be fine to have a rotation.

On Mon, Apr 15, 2019 at 10:44 PM Lars Volker 
wrote:

> Hi,
>
> I'd be happy to help. I have organized a few of these in the past, and I've
> recently started similar meetings for the Impala project.
>
> If someone else wants to do it, that's fine for me, too, of course.
>
> Cheers, Lars
>
> On Mon, Apr 15, 2019, 22:14 Julien Le Dem  wrote:
>
> > Hello all,
> > Since I have been away with the new baby the Parquet syncs have fallen
> > behind.
> > I'd like a volunteer to run those.
> > Responsibilities include taking notes and posting them on the list.
> > Also occasionally finding a good time for the meeting.
> > Any takers? This could be a rotating duty as well.
> > Thank you
> > Julien
> >
>

Re: Parquet Sync

2019-04-15 Thread Julien Le Dem

No requirement to be a PMC member no.

On Mon, Apr 15, 2019 at 10:41 PM Xinli shang 
wrote:

> Is there any requirement like being PMC of Parquet?
>
> On Mon, Apr 15, 2019 at 10:14 PM Julien Le Dem 
> wrote:
>
> > Hello all,
> > Since I have been away with the new baby the Parquet syncs have fallen
> > behind.
> > I'd like a volunteer to run those.
> > Responsibilities include taking notes and posting them on the list.
> > Also occasionally finding a good time for the meeting.
> > Any takers? This could be a rotating duty as well.
> > Thank you
> > Julien
> >
> --
> Xinli Shang
>

Parquet Sync

2019-04-15 Thread Julien Le Dem

Hello all,
Since I have been away with the new baby the Parquet syncs have fallen
behind.
I'd like a volunteer to run those.
Responsibilities include taking notes and posting them on the list.
Also occasionally finding a good time for the meeting.
Any takers? This could be a rotating duty as well.
Thank you
Julien

[Draft REPORT] Apache Parquet - January 2019

2019-01-07 Thread Julien Le Dem

## Description:
Parquet is a standard and interoperable columnar file format
for efficient analytics. Parquet has 3 sub-projects:
- parquet-format: format reference doc along with thrift based metadata
definition (used by both sub-projects bellow)
- parquet-mr: java apis and implementation of the format along with
integrations to various projects (thrift, pig, protobuf, avro, ...)
- parquet-cpp: C++ apis and implementation of the format along with Python
bindings and arrow integration.

## Issues:
 No issue at this time

## Activity:
Current activity around:

   - encryption
   - Page indexing
   - cutting a new release
   - improvement on parquet-proto


## Health report:
The discussion volume on the mailing lists is stable.
Tickets get created and closed at a reasonable pace.

## PMC changes:

 - Currently 24 PMC members.
 - No new PMC members added in the last 3 months
 - Last PMC addition was Zoltan Ivanfi on Sun Apr 15 2018

## Committer base changes:

 - Currently 31 committers.
 - No new committers added in the last 3 months
 - Last committer addition was Benoit Hanotte at Mon May 28 2018

## Releases:

 - Last release was Format 2.6.0 on Mon Oct 01 2018

## Mailing list activity:

 - dev@parquet.apache.org:
- 216 subscribers (up 2 in the last 3 months):
- 529 emails sent to list (757 in previous quarter)


## JIRA activity:

 - 49 JIRA tickets created in the last 3 months
 - 65 JIRA tickets closed/resolved in the last 3 months

Re: [Discuss] Code of conduct

2018-12-11 Thread Julien Le Dem

strangely enough I was unaware of the apache CoC which has been around for
a while.
How about we add a CODE_OF_CONDUCT.md at the root of the repo pointing to
the apache CoC?
It seems to be the place people would look at first.

On Sun, Dec 9, 2018 at 8:54 PM Uwe L. Korn  wrote:

> Hello Julien,
>
> As per ASF guideline
> https://www.apache.org/foundation/policies/conduct.html applies also to
> the Apache Parquet channels. Would that be sufficient for you?
>
> Cheers
> Uwe
>
> On Sat, Dec 8, 2018, at 2:14 AM, Julien Le Dem wrote:
> > We currently don’t have an explicit code of conduct. We’ve always
> > encouraged respectful discussions and as far as I know all discussions
> have
> > been that way.
> > However, I don’t think we should wait for an incident to create the need
> > for an explicit code of conduct. I suggest we adopt the contributor
> > covenant as it is well aligned with our values as far as I am concerned.
> > I also think that explicitly adopting it will encourage others to do the
> > same in the open source community.
> > Best
> > Julien
>

[Discuss] Code of conduct

2018-12-07 Thread Julien Le Dem

We currently don’t have an explicit code of conduct. We’ve always
encouraged respectful discussions and as far as I know all discussions have
been that way.
However, I don’t think we should wait for an incident to create the need
for an explicit code of conduct. I suggest we adopt the contributor
covenant as it is well aligned with our values as far as I am concerned.
I also think that explicitly adopting it will encourage others to do the
same in the open source community.
Best
Julien

parquet-sync notes December 5 2018

2018-12-05 Thread Julien Le Dem

Deepak: encryption, column statistics

Zoltan: vote on the release

Nandor:

Ryan (netflix): release candidate, validation of the release, encryption

Gidon (IBM): update encryption

Lars (Cloudera Impala):

Qinghui (Criteo): PR in parquet-proto, next release.

Replace current proto compiler: maven-protoc plugin: more portable

Support Enum in protobuf in backward compatibility

Steven (Yelp)

Protobuf:

   -

   2 PRs:


   -

   More portable proto plugin
   -

   Enum support. PARQUET-1455, https://github.com/apache/parquet-mr/pull/561
   -

  Makes it consistent with protobuf behavior.


   -

   Actions:


   -

   Merge PR with the new proto plugin.
   -

   Have a committer familiar with proto (Benoit) review #561


Encryption:

   -

   Mention of the order preserving encryption. Can be used to compare
   encrypted statistics
   -

   Finalizing the spec
   -

  No more technical changes
  -

  Just clarifying the details
  -

  Goal to merge it before the end of the year.
  -

   Action:
   -

  Gidon to send an updated version in a few days
  -

  Will start a thread this week


Release candidate:

   -

   Need to vote on the release
   -

   What tests have been done?
   -

  Unit tests
  -

  Benchmark for perf
  -

  Cloudera Internal integration tests.
  -

 Some updates to that integration pipeline due to parquet changes
 (proto, shaded avro)
 -

 Hive -> hive, Hive -> impala, impala -> hive tested
 -

 Ran spark unit tests
 -

  Not run:
  -

 Spark benchmark
 -

  parquet-cpp lagging behind
  -

  Action:
  -

 Gabor, Zoltan: Produce a summary
 -

 Write a validator that verifies the contract of statistics (ex:
 all values greater than the min)

Re: Parquet sync meeting notes

2018-11-06 Thread Julien Le Dem

- I reached out to Ryan who will get back on the PR
- I reached out to Jacques regarding page level stats
- also advertised it on twitter:
https://twitter.com/J_/status/1059860813115052032

On Tue, Nov 6, 2018 at 9:30 AM Julien Le Dem 
wrote:

> Attendees:
>
>- Gabor (Cloudera)
>- Nandor (Cloudera)
>- Zoltan (Cloudera): new parquet-mr release
>- Anna (Cloudera): new parquet-mr release. Would like encryption
>update
>- Gidon (IBM): status of encryption design sign off
>- Xinli (Uber): encryption
>- Steven (Yelp)
>- Julien (Wework)
>- Aniket (Google): cloud dataproc. Interest in bloom filter.
>
>
> Parquet-mr release:
>
>- Column indexes
>- Jira open: remove the page level statistics for:
>https://issues.apache.org/jira/browse/PARQUET-1365
>
> <https://meet.google.com/linkredirect?authuser=1=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FPARQUET-1365>
>- Action: reach out page level stats.
>
>
> Encryption:
>
>- https://github.com/apache/parquet-format/pull/114
>- Work on c++ implementation and at Uber is blocked on this.
>
>
> Bloom Filter:
>
>- Will reach out on the mailing list
>
>
> Meeting time:
>
>-  Will start a new vote.
>
>
>

Parquet sync meeting notes

2018-11-06 Thread Julien Le Dem

Attendees:

   - Gabor (Cloudera)
   - Nandor (Cloudera)
   - Zoltan (Cloudera): new parquet-mr release
   - Anna (Cloudera): new parquet-mr release. Would like encryption update
   - Gidon (IBM): status of encryption design sign off
   - Xinli (Uber): encryption
   - Steven (Yelp)
   - Julien (Wework)
   - Aniket (Google): cloud dataproc. Interest in bloom filter.


Parquet-mr release:

   - Column indexes
   - Jira open: remove the page level statistics for:
   https://issues.apache.org/jira/browse/PARQUET-1365
   

   - Action: reach out page level stats.


Encryption:

   - https://github.com/apache/parquet-format/pull/114
   - Work on c++ implementation and at Uber is blocked on this.


Bloom Filter:

   - Will reach out on the mailing list


Meeting time:

   -  Will start a new vote.

Re: How to reduce the "committed time" for contributions

2018-10-17 Thread Julien Le Dem

Thanks for starting this discussion Anna.
I agree we need to improve and should try your suggestions
What do others think?

On Tue, Oct 16, 2018 at 11:46 Anna Szonyi 
wrote:

> Hi All,
>
> I wanted to follow up on the discussion we had on the weekly sync
> meeting, namely: how can we reduce the "time to committed" for a
> contribution without compromising quality.
>
> A few of the ideas we were talking about on the meeting (and some I've
> seen work on other projects):
>
> *For contributors:*
>
>  - Incentivize newer contributors to cross-review each-others' PRs, so
> the review burden is reduced on the committers
>  - Utilize feature branches more consistently as master-proxies, so
> the reviews get smaller, more incremental and should thus reduce the
> overall complexity of the review for large features, as proposed by
> Julien
>  - Discuss community best practices wrt PRs that are waiting for
> feedback: like waiting periods, pinging people, calling interactive
> review meetings, or anything related.
>
> *For committers/reviewers:*
>
>  - From a reviewer's perspective we could also discuss some best
> practices or etiquette rules of thumb: e.g. in my previous project we
> had other committers start timeouts on blocks, reviewers pinging other
> reviewers if you were going to be unavailable or we would propose some
> alternative methods for resolving issues (interactive/live
> reviews/discussions)
>
>  - We should also make it explicit if a design doc is needed for a
> particular feature and that the review on that doc is a blocker for
> the rest of the code reviews, so we don't end up creating PRs before
> the approach is vetted
>
>
> I wanted to raise this, so we can leverage the collective experience
> (with processes/best practices) of the community and maybe discuss the
> useful aspects on the next sync meeting.
>
> Best,
> Anna
>

Re: [RESULT] [VOTE] Moving Apache Parquet C++ development process to a monorepo structure with Apache Arrow C++

2018-10-15 Thread Julien Le Dem

What does archiving the master branch look like? Are we renaming master and
leaving a readme pointing to the new repo?


On Thu, Sep 20, 2018 at 3:30 PM Wes McKinney  wrote:

> OK. There is still some code (examples, CLI tools) that needs to be
> moved over. Once that's done and all the outstanding PRs are
> moved/closed, I will do that
> On Thu, Sep 20, 2018 at 8:45 AM Uwe L. Korn  wrote:
> >
> > Hello Wes,
> >
> > I'm definitely +1 on archiving the master branch. I'm not sure what you
> mean exactly with this. I would have simply added a final commit that
> deletes all code and adds a message to the README that the repository has
> moved into a another repo.
> >
> > Cheers
> > Uwe
> >
> > On Thu, Sep 13, 2018, at 10:47 PM, Wes McKinney wrote:
> > > hi folks,
> > >
> > > Could I get some feedback about the follow-up items? There are still
> > > some parts of the codebase that need to be migrated. Additionally, I'm
> > > proposing to archive the master branch so that people with build
> > > toolchains running against parquet-cpp master will be forced to
> > > migrate. The hard part is over; I would like to get things closed out
> > > on apache/parquet-cpp and move development forward.
> > >
> > > Thanks,
> > > Wes
> > > On Sun, Sep 9, 2018 at 8:45 PM Wes McKinney 
> wrote:
> > > >
> > > > Might make sense to archive the master branch so that people's
> > > > now-outdated build toolchains (where they may be cloning
> > > > apache/parquet-cpp) will fail fast. We are already starting to get
> bug
> > > > reports along these lines.
> > > >
> > > > Thoughts?
> > > > On Sat, Sep 8, 2018 at 10:43 AM Wes McKinney 
> wrote:
> > > > >
> > > > > We should probably also write a blog post on the Apache Arrow
> website
> > > > > to increase visibility of this move to the broader community.
> > > > >
> > > > > On Sat, Sep 8, 2018 at 10:42 AM Wes McKinney 
> wrote:
> > > > > >
> > > > > > Dear all -- the merge has been completed, thank you! 318 patches
> > > > > > (after the filter-branch grafting procedure) were merged to
> > > > > > apache/arrow
> > > > > >
> > > > > > We have some follow up work to do:
> > > > > >
> > > > > > * Move patches from apache/parquet-cpp to apache/arrow
> > > > > > * Add CONTRIBUTING.md and note to README that patches are no
> longer
> > > > > > accepted at the old location
> > > > > > * Migrate CLI utiltiies and other small items that did not
> survive the
> > > > > > merge: tools/, benchmarks/, and examples/
> > > > > > * Develop new release procedure for Apache Parquet
> > > > > >
> > > > > > On this third point, we can also import their git history if
> desired.
> > > > > > Incorporating them into the build will be comparatively easy to
> the
> > > > > > library integration.
> > > > > >
> > > > > > There are already some JIRA issues open for some of these, but
> > > > > > anything else please create issues so we can keep track.
> > > > > >
> > > > > > I'm already quite excited to get busy with some refactoring and
> > > > > > internals improvements that I had avoided because of the painful
> > > > > > development procedure.
> > > > > >
> > > > > > Thanks,
> > > > > > Wes
>

parquet sync notes

2018-10-09 Thread Julien Le Dem

Gabor (Cloudera): column index, benchmark, nested types (filter, indexes)
Anna (Cloudera): process, feature branches, etiquette of waiting for
someone? Blocked
Zoltan (Cloudera): Feature branches? When to review them?
Nandor (Cloudera)
 parquet file with multiple row groups, schema evolution
Zoltan (Cloudera): column index
Junjie (tencent): listening
Gidon (IBM): encryption next steps
Jim: bloom filter, Bit weaving
Xinli (Uber): encryption
Julien (WeWork): encryption

Bloom filter:

   -  PR for doc. Parquet-format feature branch.
   -
  - To be reviewed by: Deepak, Jim, Ryan.


Encryption:

   - Another encryption effort exists, Julien to send intros: Xinli,
   Giddon, Zoltan
   - New requirements, updated doc, implement code changes.


Process:

   - Feature branches:
   -
  - Julien to follow up with Ryan
  - Feature branches are considered like master:
  -
 - Every changed is reviewed individually through a PR
 - Every change has a jira
 - Only difference is that it’s ok to make incompatible changes
 - Squash merge vs merge commit:
  -
 - Merge commit keeps the history but clutters
 - 3 options:
  -
 - Merge commit
 -
- Clutters history (not linear anymore)
- But if each commit in the branch has a jira seems fine
- Squash:
 -
- Loses the detailed commits of the feature
- Keeps history linear
- Rebase feature branch
 -
- Keeps history linear and keeps history
- But need to address conflicts for each commit in branch
- Commits in branch are now disconnected from the PR (modified
after the facts).
-  When is it appropriate to wait:
   -
  - Balance:
  -
 - making sure we don’t make incompatible changes to the format and
 we have final features
 - Making it easier for people to contribute.
 - Anna to start a conversation around our etiquette
  -
 - How long is it appropriate to wait on feedback
 - How to know who’s the best committer to drive a PR to conclusion


Filtering nested types support:

   -  We should store stats for nested types


Page Index benchmark:

   - Nice results, comparing random to sorted files:
   -
  -
  
https://jmh.morethan.io/?gist=2388d962d6380f74a78ad0d97b4353a2/benchmarkWithOrWithoutColumnIndex.json
  -
  
https://jmh.morethan.io/?gist=2388d962d6380f74a78ad0d97b4353a2/benchmarkPageSize.json
  - Need to compare page size affect on compression and file size


Appending to a parquet file:

   -  The type of a column chunk should be consistent with the schema in
   the footer.

Re: BitWeaving in Parquet?

2018-10-08 Thread Julien Le Dem

If you want (and if you don't already know him) I'm happy to ask Jignesh if
he wants an intro.
I think he would be happy to tell you about it.

On Mon, Oct 8, 2018 at 4:04 PM Jim Apple  wrote:

> > That sounds like an interesting possibility. It's not that fresh in my
> mind
> > but I'd say from the storage perspective it's a variation of bit packing.
> > right?
>
> I'm not familiar with bit packing, so I'd have to look into that. I found
> the paper readable enough at the time that I didn't end up doing a lot of
> groundwork reading to understand the origins.
>
> > We would need an implementation of a runtime for this to make sense, so I
> > suppose that the impala team is looking into implementing this?
>
> The Impala community hasn't been discussing this, as far as I am aware. I
> came across it in another paper and thought it might be of interest to
> consumers of Parquet, including Impala, but this is the first place I'm
> shopping the idea around.
>

Re: BitWeaving in Parquet?

2018-10-08 Thread Julien Le Dem

Hi Jim,
I remember chatting with Jignesh Patel about it at the time.
Since his company locomatix was acquired by twitter we had him as an
adviser for some time.
That sounds like an interesting possibility. It's not that fresh in my mind
but I'd say from the storage perspective it's a variation of bit packing.
right?
We would need an implementation of a runtime for this to make sense, so I
suppose that the impala team is looking into implementing this?
It would be interesting to have this type of "compressed" vector in Arrow
too. But I don't know if you're looking into Arrow on your end
Cheers,
Julien

On Mon, Oct 8, 2018 at 2:53 PM Jim Apple  wrote:

> The BitWeaving paper from a few years ago demonstrates some large
> performance wins in predicate evaluation based partially on reconfiguring
> the storage layout:
>
> http://pages.cs.wisc.edu/~jignesh/publ/BitWeaving.pdf
>
> Is it technically possible for Parquet to support "Vertical Bit-Parallel"
> layout as an option?
>

parquet sync notes

2018-09-25 Thread Julien Le Dem

Lars (Cloudera Impala): listen in.
Zoltan, Gabor and Nandor (Cloudera):

   - feature branch reviewed and merged
   - Parquet-format release
   -
  - Define scope

Ryan (Netflix)
Junjie (tencent): bloom filter
Jim Apple (cloud service): bloom filter in parquet-mr? Since they got in
parquet-cpp
Gidon (IBM): encrytpion
Sahil (Cloudera impala, hive): listen in
Julien (Wework)

Status update from Gabor:

   -  Waiting for reviews.
   -
  - Plan to merge this Friday.
  - Please review in the next few days.

Parquet format release:

   - Nanosecond precision
   - Deprecation of java related code
   - Encryption metadata
   -
  - One more pr to merge
  - Plan:
   -
  - Revert the encryption patches and put them in feature branch in
  parquet-format
  - Apply the same process to bloom filters
  - Owner of pr can update it to the feature branch


Encryption:

   - Old readers can read non encrypted columns
   -
  - Changes to metadata
  - One last PR on parquet-format
  - We should have a vote before merging it.
   - Make sure parquet-cpp depends on the source of truth thrift in
   parquet-format.


Bloom filter:

   - parquet-format/62 and parquet-format/99
   - parquet-format/28: should be closed as is outdated. We should port the
   doc to the more recent PR.

Re: Date and time for next Parquet sync

2018-08-28 Thread Julien Le Dem

Notes:
Anna (Cloudera): Bloom filter update, Iceberg
Gabor, Nandor (Cloudera):

   - Value skipping implementation to be reviewed. Move Java code from
   parquet-format to parquet-mr. PR ready
   - How can users of Parquet handle timestamps and TZs. Allow for writing
   timestamp in java. Refactor original type logic to more flexible new
   original type api.
   - Column indexes and alignment of pages
   - Limiting the number of records in a page to avoid skewed splits when
   compression is really good.

Ryan (Netflix): Iceberg stuff back to Parquet: expression library for push
down. Dictionary and stats based row group filtering.
JunJie (Intel): Bloom filter. Need more reviews. Have a vote on the design
and add it to parquet-format.
Julien (Wework): Encryption.


   - Bloom Filter:
   https://issues.apache.org/jira/projects/PARQUET/issues/PARQUET-41
   

   -
  - Committed utility class to parquet-cpp
  - Uploaded the benchmark result.
  - Ready to add into the spec.
  - Submit a PR for the parquet reader spec.
  - *Action*: review parquet java utility class.
  https://github.com/apache/parquet-mr/pull/425
  - Encryption:
   -
  - Nandor, Gabor reviewing.
  - Apis to allow pluggable key management.
  - Need to have a proper review of the spec.
  - Need more testing
  - Column indices:
   -
  - PR to be reviewed: https://github.com/apache/parquet-mr/pull/514
  - Ryan: to review features branch
  - Moving java code from parquet-format to parquet-mr:
   -
  - Action: review. https://github.com/apache/parquet-mr/pull/517
  - Gets the thrift file from the parquet-format released artifact.
  - Maximum number of records per page:
   -
  - We should add a property with a maximum number of records per page
  and per row group.
  - Need to benchmark to figure out a good default. 10K?
  - Iceberg:
   -
  - Some of the iceberg code should be in Parquet:
  -
 - Rewrote record reconstruction stack
 -
- Reuses page reader and decoder
- Then does a triple iterator that return an entire column in a
file (iterator of triples)
- Record reconstruction class that handles everything that the
current one does but with {list, map} factories
-
   - 20% faster to write, 5% faster to read
   - Easier to write object mappers
- Helps with page level skipping.
- High level abstractions in the iceberg library:
 -
- Take an expression and simplify it (not, ...) to run on
metadata
- Take a complex expression and split the part on the
partition/min/max and the remaining part.






On Mon, Aug 27, 2018 at 4:56 AM, Nandor Kollar  wrote:

> Yes, CEST.
>
> On Mon, Aug 27, 2018 at 1:01 PM, Uwe L. Korn  wrote:
> > Hello Nador,
> >
> > probably I can make this time. Just a timezone question: Is it 6pm CET
> or 6pm CEST? I guess the latter.
> >
> > See http://timesched.pocoo.org/?date=2018-08-28=central-
> europe-standard-time!,pacific-standard-time=1080,1140
> >
> > Uwe
> >
> > On Mon, Aug 27, 2018, at 12:20 PM, Nandor Kollar wrote:
> >> Hi All,
> >>
> >> As discussed on last Parquet sync, I propose to have an other meeting
> >> on August 28th, at 6pm CET / 9 am PST to discuss those topic which we
> >> didn't have time on the sync at August 15th, and of course any new
> >> topic too.
> >>
> >> Sorry for the late notice, feel free to propose other time slot if is
> >> is not suitable for you! Calendar entry to follow.
> >>
> >> Regards,
> >> Nandor
>

Parquet sync notes

2018-06-12 Thread Julien Le Dem

 QingHui (Criteo): parquet-protobuf
Lars (impala), Jim (Cloudera): Bloom filter benchmarks
Ryan (Netflix):
JunJie (Intel): Bloomfilter and dictionary comparison benchmarks
Gidon (IBM): Encryption, feedback
Xinli Shang (Uber): Encryption

Bloomfilter and dictionary comparison benchmarks:

   - PARQUET-41
   - Feedback to find the number of distinct values for which bloom filter
   outperforms filter based search
   - Action: JunJie to share code and update benchmark

Encryption:

   - Progress on multi-key design: need review
   - Need review on PR as well
   - Discussion on how to pass parameters down to Parquet to specify what
   to encrypt
   - Action: Gidon to share PR again and others to review.

Parquet sync notes

2018-06-07 Thread Julien Le Dem

Attendees / Agenda:
Gidon (IBM): Parquet encryption. Uber, Vertica, Amazon
Anna, Gabor, Nandor (Cloudera): Review for column indexing
Junjie (tencent): Bloom filter
Lars (Cloudera impala)
Jim (Cloudera): Bloom filter
Deepak (Vertica): Encryption
Qinghui, Benoit (Criteo): parquet protobuf.

Parquet encryption:
* Deepak will look at the code this week.
* Gidon update:
* multi key encryption (one for keys and one for footer)
* Implementation available.
* Working on performance evaluation
* Starting in java 9 encryption is hardware accelerated and much
better. Very little overhead
* Java 8 encryption has more overhead.
* If using gzip overhead is small
* If using snappy, overhead is high
* Added a second encryption implementation that is faster but less
secure for java 8
* Advantage of 2 algorithms: makes us think of formalization of
also in metadata.
* Use case to use encryption without api. Through Hadoop config to pass
info.
* Modified design document
* Discussion on metadata.
* Column indexes do not replace the statistics in the footer but replace
the statistics in the page header.
Column indexing:
* Parquet-mr/pr/481

* Encryption
* [Some things covered already before these notes started]
* Hardware support for encryption? Yes power. Not sure if ARM.
Definitely x86-64
* Bloom filters: C++ needs review, but also doing performance tests
* Guava Bloom filter: Not sure if compat between version. Impala BFs
might be much faster
* Java vs. C++ compat: there will be tests
* Column indexing
* parquet-mr 481 https://github.com/apache/parquet-mr/pull/481
* Right now doing in a separate branch for compat reasons. Not sure the
write path will work.
* That branch has 3 or more commits
* Column indexes will be stored just before the filter. Will the
statistics (before the footer) still be useful with column indexing - can
we just leave them out.
* Filter is for row-groups, column indexing is for pages?
* Do we store the maximum value in a page, or a value that is
greater than or equal to the largest value in the page? Impala does the
latter; PR#481 does that for some pages, but not all (?)

[Announce] new Parquet committer Benoit Hanotte

2018-05-29 Thread Julien Le Dem

We are happy to announce that Benoit has accepted to become a Parquet
committer.
Welcome Benoit!

Re: Permissions for committers

2018-05-22 Thread Julien Le Dem

You don’t push commits to GitHub. You push them to the Apache git and they
get replicated to GitHub

On Tue, May 22, 2018 at 09:37 Julien Le Dem <julien.le...@wework.com> wrote:

> Do you have your github id configured in I’d.apache.org ?
>
> On Tue, May 22, 2018 at 06:18 Gabor Szadovszky <ga...@apache.org> wrote:
>
>> Hi,
>>
>> Could someone help me to have the required permissions on github so I can
>> push commits?
>>
>> Thanks a lot,
>> Gabor
>>
>

Re: Permissions for committers

2018-05-22 Thread Julien Le Dem

Do you have your github id configured in I’d.apache.org ?

On Tue, May 22, 2018 at 06:18 Gabor Szadovszky  wrote:

> Hi,
>
> Could someone help me to have the required permissions on github so I can
> push commits?
>
> Thanks a lot,
> Gabor
>

[Announce] new Parquet committer Constantin Muraru

2018-05-21 Thread Julien Le Dem

We are happy to announce that Constantin has accepted to become a Parquet
committer.
Welcome Constantin!

Re: Parquet Data Help

2018-05-21 Thread Julien Le Dem

This sounds like a hive question rather than a parquet question.
Did you try posting on the hive mailing list?

On Mon, May 21, 2018 at 12:59 AM, Shubham gurav 
wrote:

> Hey Dev,
>
> Currently using Hive 0.13 and our database is in parquet format. When i
> extract the data the output contains unicode characters like thorn
> delimiters -  þ or replacement characters (Unicode characters).
>
> So do we have to migrate to the latest version or Hive 0.13.1 supports
> parquet data. If yes, how is the syntax to be used and detailed guidelines
> on it.
>
> I have tried using delimiter with \-61 but still it has failed to get the
> data in correct format.
>
> Any help would  be highly appreciated.
>
> Regards,
> Shubham
>

notes parquet sync May 9 2018

2018-05-10 Thread Julien Le Dem

Attendees and agenda building

Deepak (vertica) : encryption cpp code
Jim (Cloudera Palo Alto)
Lars (Cloudera, impala)
Nandor, Zoltan, Anna (Cloudera Budapest):

   -  Breaking changes: avoid backwards incompatible changes

Benoit (Criteo)
Ryan (netflix)
Julien (WeWork)

Notes:

   - encryption cpp code: Deepak will sync with Gideon


Handling breaking changes and backward compatibility

   - Discussion of using annotations vs separate packages of apis vs
   implementation to define parquet’s official public api.
   -
  - We agreed to discuss in PRs over concrete examples.
   - Existing problem:
   -
  - Convert method in ThriftSchemaConverter
  - PARQUET-405:
  - PARQUET-287 => backward incompatible changes
  - Anna to provide code example that has the problem.

[jira] [Resolved] (PARQUET-968) Add Hive/Presto support in ProtoParquet

2018-04-26 Thread Julien Le Dem (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem resolved PARQUET-968.
---
   Resolution: Fixed
Fix Version/s: 1.11

> Add Hive/Presto support in ProtoParquet
> ---
>
> Key: PARQUET-968
> URL: https://issues.apache.org/jira/browse/PARQUET-968
> Project: Parquet
>  Issue Type: Task
>Reporter: Constantin Muraru
>Assignee: Constantin Muraru
>Priority: Major
> Fix For: 1.11
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (PARQUET-968) Add Hive/Presto support in ProtoParquet

2018-04-26 Thread Julien Le Dem (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem reassigned PARQUET-968:
-

Assignee: Constantin Muraru  (was: Julien Le Dem)

> Add Hive/Presto support in ProtoParquet
> ---
>
> Key: PARQUET-968
> URL: https://issues.apache.org/jira/browse/PARQUET-968
> Project: Parquet
>  Issue Type: Task
>Reporter: Constantin Muraru
>Assignee: Constantin Muraru
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (PARQUET-968) Add Hive/Presto support in ProtoParquet

2018-04-26 Thread Julien Le Dem (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453972#comment-16453972
 ] 

Julien Le Dem commented on PARQUET-968:
---

merged in 
https://github.com/apache/parquet-mr/commit/f84938441be49c665595c936ac631c3e5f171bf9

> Add Hive/Presto support in ProtoParquet
> ---
>
> Key: PARQUET-968
> URL: https://issues.apache.org/jira/browse/PARQUET-968
> Project: Parquet
>  Issue Type: Task
>Reporter: Constantin Muraru
>Assignee: Constantin Muraru
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (PARQUET-968) Add Hive/Presto support in ProtoParquet

2018-04-26 Thread Julien Le Dem (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem reassigned PARQUET-968:
-

Assignee: Julien Le Dem

> Add Hive/Presto support in ProtoParquet
> ---
>
> Key: PARQUET-968
> URL: https://issues.apache.org/jira/browse/PARQUET-968
> Project: Parquet
>  Issue Type: Task
>Reporter: Constantin Muraru
>        Assignee: Julien Le Dem
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (PARQUET-1281) Jackson dependency

2018-04-24 Thread Julien Le Dem (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16450163#comment-16450163
 ] 

Julien Le Dem commented on PARQUET-1281:


parquet-hadoop should have its build include shading like parquet thrift:

https://github.com/apache/parquet-mr/blob/master/parquet-thrift/pom.xml#L174

> Jackson dependency
> --
>
> Key: PARQUET-1281
> URL: https://issues.apache.org/jira/browse/PARQUET-1281
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Qinghui Xu
>Priority: Major
>
> Currently we shaded jackson in parquet-jackson module (org.codehaus.jackon 
> --> shaded.parquet.org.codehaus.jackson), but in fact we do not use the 
> shaded jackson in parquet-hadoop code. Is that a mistake? (see 
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/ParquetMetadata.java#L26)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Parquet sync

2018-04-24 Thread Julien Le Dem

Happening now:
https://meet.google.com/esu-yiit-mun

Re: Date and time for the next Parquet sync

2018-04-20 Thread Julien Le Dem

+1

On Wed, Apr 18, 2018 at 9:23 AM, Zoltan Ivanfi  wrote:

> +1, thanks Lars!
>
> On Wed, Apr 18, 2018 at 6:20 PM Lars Volker  wrote:
>
> > Hi All,
> >
> > It has been 3 weeks since our last Parquet community sync and I think it
> > would be great to have one next week. Last time we met on a Wednesday, so
> > this time it should be Tuesday.
> >
> > I'd like to propose next Tuesday, April 24th, at 6pm CET / 9 am PST.
> >
> > Please speak up if that time does not work for you.
> >
> > Cheers, Lars
> >
>

Re: [VOTE] Release Apache Parquet Format 2.5.0 RC0

2018-04-12 Thread Julien Le Dem

+1 (binding)
checked signature
ran build and tests

On Mon, Apr 9, 2018 at 8:44 AM, Ryan Blue  wrote:

> +1 (binding)
>
> Checked this for the last vote.
>
> On Mon, Apr 9, 2018 at 4:53 AM, Gabor Szadovszky <
> gabor.szadovs...@cloudera.com> wrote:
>
> > Hi everyone,
> >
> > Unfortunately, the previous vote has failed due to timeout. Now, Zoltan
> > and I propose a new vote for the same RC to be released as official
> Apache
> > Parquet Format 2.5.0 release.
> >
> > The commit id is f0fa7c14a4699581b41d8ba9aff1512663cc0fb4
> > * This corresponds to the tag: apache-parquet-format-2.5.0
> > * https://github.com/apache/parquet-format/tree/
> > f0fa7c14a4699581b41d8ba9aff1512663cc0fb4  > parquet-format/tree/f0fa7c14a4699581b41d8ba9aff1512663cc0fb4>
> >
> > The release tarball, signature, and checksums are here:
> > * https://dist.apache.org/repos/dist/dev/parquet/apache-
> > parquet-format-2.5.0-rc0/  > repos/dist/dev/parquet/apache-parquet-format-2.5.0-rc0/>
> >
> > You can find the KEYS file here:
> > * https://dist.apache.org/repos/dist/dev/parquet/KEYS <
> > https://dist.apache.org/repos/dist/dev/parquet/KEYS>
> >
> > Binary artifacts are staged in Nexus here:
> > * https://repository.apache.org/content/groups/staging/org/
> > apache/parquet/parquet-format/  > org/content/groups/staging/org/apache/parquet/parquet-format/>
> >
> > This release includes important changes that I should have summarized
> > here, but I'm lazy.
> > See https://github.com/apache/parquet-format/blob/
> > f0fa7c14a4699581b41d8ba9aff1512663cc0fb4/CHANGES.md <
> > https://github.com/apache/parquet-format/blob/
> > f0fa7c14a4699581b41d8ba9aff1512663cc0fb4/CHANGES.md> for details.
> >
> > Please download, verify, and test.
> >
> > [ ] +1 Release this as Apache Parquet Format 2.5.0
> > [ ] +0
> > [ ] -1 Do not release this because…
>
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: [RESULT][VOTE] Release Apache Parquet Format 2.5.0 RC0

2018-04-12 Thread Julien Le Dem

the release verification script for parquet-cpp is a good reference:
https://github.com/apache/parquet-cpp/blob/master/dev/release/verify-release-candidate

On Fri, Apr 6, 2018 at 8:57 AM, Ryan Blue  wrote:

> Yeah, I thought it was a hard limit when I wrote that. Then I had some
> votes fail and found out it's a bit more flexible. Sorry for the confusion.
>
> On Fri, Apr 6, 2018 at 8:50 AM, Zoltan Ivanfi  wrote:
>
> > Hi,
> >
> > We didn't know that's an option. The guide at
> https://parquet.apache.org/
> > documentation/how-to-release/ specifies 72 hours, we were under the
> > impression that that's a hard limit.
> >
> > Zoltan
> >
> > On Fri, Apr 6, 2018 at 5:43 PM Ryan Blue 
> > wrote:
> >
> >> Usually, a 3 day window is tough. I usually leave votes open-ended so
> >> people can vote as they have time. I think that if we had left this one
> >> open and pinged a couple of PMC members, it probably would have passed.
> >> Maybe try that the next time.
> >>
> >> rb
> >>
> >> On Fri, Apr 6, 2018 at 1:58 AM, Gabor Szadovszky <
> >> gabor.szadovs...@cloudera.com> wrote:
> >>
> >> > Hi All,
> >> >
> >> > For the vote for this parquet-format release have seen
> >> > 1   "+1" votes
> >> > 0   "0" votes
> >> > 0   "-1" votes
> >> >
> >> > Due to less than 3 binding votes this vote has FAILED.
> >> >
> >> > We will raise the topic of the parquet-format release on the next
> >> parquet
> >> > sync and will start a new vote after it if everyone agrees.
> >> >
> >> > Regards,
> >> > Gabor
> >> >
> >> > > On 4 Apr 2018, at 20:22, Ryan Blue 
> wrote:
> >> > >
> >> > > +1 (binding)
> >> > >
> >> > > Built & tested, validated checksums and signature. RAT results look
> >> fine.
> >> > >
> >> > > On Tue, Apr 3, 2018 at 2:57 AM, Gabor Szadovszky <
> >> > > gabor.szadovs...@cloudera.com> wrote:
> >> > >
> >> > >> Hi everyone,
> >> > >>
> >> > >> Zoltan and I propose the following RC to be released as official
> >> Apache
> >> > >> Parquet Format 2.5.0 release.
> >> > >>
> >> > >> The commit id is f0fa7c14a4699581b41d8ba9aff1512663cc0fb4
> >> > >> * This corresponds to the tag: apache-parquet-format-2.5.0
> >> > >> * https://github.com/apache/parquet-format/tree/
> >> > >> f0fa7c14a4699581b41d8ba9aff1512663cc0fb4 <
> https://github.com/apache/
> >> > >> parquet-format/tree/f0fa7c14a4699581b41d8ba9aff1512663cc0fb4>
> >> > >>
> >> > >> The release tarball, signature, and checksums are here:
> >> > >> * https://dist.apache.org/repos/dist/dev/parquet/apache-
> >> > >> parquet-format-2.5.0-rc0/  >> > >> repos/dist/dev/parquet/apache-parquet-format-2.5.0-rc0/>
> >> > >>
> >> > >> You can find the KEYS file here:
> >> > >> * https://dist.apache.org/repos/dist/dev/parquet/KEYS <
> >> > >> https://dist.apache.org/repos/dist/dev/parquet/KEYS>
> >> > >>
> >> > >> Binary artifacts are staged in Nexus here:
> >> > >> * https://repository.apache.org/content/groups/staging/org/
> >> > >> apache/parquet/parquet-format/  >> > >> org/content/groups/staging/org/apache/parquet/parquet-format/>
> >> > >>
> >> > >> This release includes important changes that I should have
> summarized
> >> > >> here, but I'm lazy.
> >> > >> See https://github.com/apache/parquet-format/blob/
> >> > >> f0fa7c14a4699581b41d8ba9aff1512663cc0fb4/CHANGES.md <
> >> > >> https://github.com/apache/parquet-format/blob/
> >> > >> f0fa7c14a4699581b41d8ba9aff1512663cc0fb4/CHANGES.md> for details.
> >> > >>
> >> > >> Please download, verify, and test.
> >> > >>
> >> > >> Please vote by 10AM on Friday, April 6, 2018 (UTC).
> >> > >>
> >> > >> [ ] +1 Release this as Apache Parquet Format 2.5.0
> >> > >> [ ] +0
> >> > >> [ ] -1 Do not release this because…
> >> > >>
> >> > >>
> >> > >
> >> > >
> >> > > --
> >> > > Ryan Blue
> >> > > Software Engineer
> >> > > Netflix
> >> >
> >> >
> >>
> >>
> >> --
> >> Ryan Blue
> >> Software Engineer
> >> Netflix
> >>
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: parquet-mr next release with PARQUET-1217?

2018-04-12 Thread Julien Le Dem

If someone wants a 1.9.1 it can be done, we'll need someone to own the
release process though.

On Tue, Apr 10, 2018 at 3:53 PM, Henry Robinson  wrote:

> Thanks! Sorry to miss the vote - was AFK for a few days. I look forward to
> testing it out anyhow.
>
> On 5 April 2018 at 14:28, Ryan Blue  wrote:
>
> > I just sent a vote for this. Took longer than expected because I had to
> > fix all of the javadoc warnings for java 8. Please test it out and vote.
> >
> > On Fri, Mar 30, 2018 at 10:44 AM, Ryan Blue  wrote:
> >
> >> I have no plan for 1.9.1.
> >>
> >> On Fri, Mar 30, 2018 at 10:42 AM, Henry Robinson 
> >> wrote:
> >>
> >>> Great! Do you know of any plans to do a 1.9.1?
> >>>
> >>> On 30 March 2018 at 09:35, Ryan Blue 
> wrote:
> >>>
>  I'm planning on getting a 1.10.0 rc out today, if I don't find
> problems
>  with the stats changes.
> 
>  On Thu, Mar 29, 2018 at 4:18 PM, Henry Robinson 
>  wrote:
> 
>  > Hi all -
>  >
>  > While using Spark, I got hit by PARQUET-1217 today on some data
>  written by
>  > Impala. This is a pretty nasty bug, and one that affects Apache
> Spark
>  right
>  > now because, AFAICT, there's no release to move to that contains the
>  fix,
>  > and parquet-mr 1.9.0 is affected. There is a workaround, but it's
>  expensive
>  > in terms of lost performance.
>  >
>  > I'm new to the community, so wanted to see if there was a plan to
>  make a
>  > release (1.9.1?) in the near future. I'd rather that than have to
>  build
>  > short-term workarounds into Spark.
>  >
>  > Best,
>  > Henry
>  >
> 
> 
> 
>  --
>  Ryan Blue
>  Software Engineer
>  Netflix
> 
> >>>
> >>>
> >>>
> >>> --
> >>> Henry Robinson
> >>> Software Engineer
> >>> Cloudera
> >>> 415-994-6679 <(415)%20994-6679>
> >>>
> >>
> >>
> >>
> >> --
> >> Ryan Blue
> >> Software Engineer
> >> Netflix
> >>
> >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
> >
>

Re: [VOTE] Release Apache Parquet Java 1.10.0 RC0

2018-04-06 Thread Julien Le Dem

+1 (binding)
* verified signature and checksum
* build and tested on osx

On Fri, Apr 6, 2018 at 9:55 AM, Uwe L. Korn  wrote:

> +1 (binding)
>
> * Verified signature and checksum
> * Build and tested on OSX using `mvn clean install
> -Dthrift.executable=/usr/local/opt/thrift@0.9/bin/thrift`
>
> On Fri, Apr 6, 2018, at 6:25 PM, Ryan Blue wrote:
> > Make that -D, not -P.
> >
> > On Fri, Apr 6, 2018 at 9:24 AM, Ryan Blue  wrote:
> >
> > > You can either put 0.9 earlier in your PATH, or set thrift.executable:
> > >
> > > mvn clean install -Pthrift.executable=/path/to/bin/thrift
> > >
> > > 
> > >
> > > On Fri, Apr 6, 2018 at 9:11 AM, Uwe L. Korn  wrote:
> > >
> > >> The build is failing for me because it is picking up my installation
> of
> > >> Thrift 0.11. Is there a variable I could set to point it to my Thrift
> 0.9
> > >> installation?
> > >>
> > >> Uwe
> > >>
> > >> On Fri, Apr 6, 2018, at 2:34 PM, Zoltan Ivanfi wrote:
> > >> > I would have preferred waiting for the parquet-format release (which
> > >> > unfortunately failed the vote due to lack of interest) before
> making an
> > >> RC
> > >> > for parquet-mr so that it would refer to the latest format. On the
> other
> > >> > hand, all relevant updates of parquet-format are in the
> documentation
> > >> only,
> > >> > so it's not a reason for redoing the RC in itself.
> > >> >
> > >> > +1 (non-binding)
> > >> >
> > >> >- Reviewed commits that were added during the release process
> > >> without a
> > >> >proper review.
> > >> >- Validated checksums and signature.
> > >> >- Compared the source tarball to the git repo's state at the
> release
> > >> >label.
> > >> >
> > >> > Zoltan
> > >> >
> > >> > On Fri, Apr 6, 2018 at 8:23 AM Gabor Szadovszky <
> > >> > gabor.szadovs...@cloudera.com> wrote:
> > >> >
> > >> > > +1 (non-binding)
> > >> > >
> > >> > > Validated signature, checksums, matched source tarballs with the
> git
> > >> repo.
> > >> > >
> > >> > > Gabor
> > >> > >
> > >> > > > On 6 Apr 2018, at 03:05, Ryan Blue  wrote:
> > >> > > >
> > >> > > > And if anyone wants to try out the new command-line interface,
> you
> > >> can
> > >> > > call
> > >> > > > it like this:
> > >> > > >
> > >> > > > hadoop jar parquet-cli-1.10.0-runtime.jar
> > >> org.apache.parquet.cli.Main
> > >> > > >
> > >> > > > rb
> > >> > > > 
> > >> > > >
> > >> > > > On Thu, Apr 5, 2018 at 5:58 PM, Ryan Blue 
> wrote:
> > >> > > >
> > >> > > >> +1 (binding)
> > >> > > >>
> > >> > > >> Built, tested, validated signature and checksums, and tested
> the
> > >> Iceberg
> > >> > > >> build with the artifacts.
> > >> > > >>
> > >> > > >> On Thu, Apr 5, 2018 at 2:15 PM, Ryan Blue 
> wrote:
> > >> > > >>
> > >> > > >>> Hi everyone,
> > >> > > >>>
> > >> > > >>> I propose the following RC to be released as official Apache
> > >> Parquet
> > >> > > Java
> > >> > > >>> 1.10.0 release.
> > >> > > >>>
> > >> > > >>> The commit id is 031a6654009e3b82020012a18434c582bd74c73a
> > >> > > >>>
> > >> > > >>>   - This corresponds to the tag: apache-parquet-1.10.0
> > >> > > >>>   - https://github.com/apache/parquet-mr/tree/031a665
> > >> > > >>>
> > >> > > >>> The release tarball, signature, and checksums are here:
> > >> > > >>>
> > >> > > >>>   - https://dist.apache.org/repos/
> dist/dev/parquet/apache-parque
> > >> > > >>>   t-1.10.0-rc0/
> > >> > > >>>
> > >> > > >>> You can find the KEYS file here:
> > >> > > >>>
> > >> > > >>>   - https://dist.apache.org/repos/dist/dev/parquet/KEYS
> > >> > > >>>
> > >> > > >>> Binary artifacts are staged in Nexus here:
> > >> > > >>>
> > >> > > >>>   - https://repository.apache.org/
> content/groups/staging/org/apa
> > >> > > >>>   che/parquet/parquet/1.10.0/
> > >> > > >>>
> > >> > > >>> This release includes:
> > >> > > >>>
> > >> > > >>>   - The new Parquet command-line tool
> > >> > > >>>   - New APIs to avoid leaking Hadoop classes
> > >> > > >>>   - Fixed sort order for logical types
> > >> > > >>>   - Fixed stats handling for NaN and other floating point edge
> > >> cases
> > >> > > >>>
> > >> > > >>> The full change log is available here:
> > >> > > >>>
> > >> > > >>>   - https://github.com/apache/parquet-mr/blob/031a665/
> CHANGES.md
> > >> > > >>>
> > >> > > >>> Please download, verify, and test.
> > >> > > >>>
> > >> > > >>> Please vote by Tuesday, 10 April 2018.
> > >> > > >>>
> > >> > > >>> [ ] +1 Release this as Apache Parquet Java 1.10.0
> > >> > > >>> [ ] +0
> > >> > > >>> [ ] -1 Do not release this because…
> > >> > > >>> 
> > >> > > >>> --
> > >> > > >>> Ryan Blue
> > >> > > >>>
> > >> > > >>
> > >> > > >>
> > >> > > >>
> > >> > > >> --
> > >> > > >> Ryan Blue
> > >> > > >>
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > > --
> > >> > > > Ryan Blue
> > >> > >
> > >> > >
> > >> > >
> > >>
> > >
> > >
> > >
> > > --
> > > Ryan Blue
> > > Software Engineer
> > > Netflix
> > >
> >
> >
> >
> > --
> > Ryan Blue
> >

[jira] [Resolved] (PARQUET-1259) Parquet-protobuf support both protobuf 2 and protobuf 3

2018-04-04 Thread Julien Le Dem (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem resolved PARQUET-1259.

Resolution: Workaround

supporting more than one version adds complexity.

It sounds like people can use protobuf 2 syntax with protobuf 3 library

I would recommend that instead.

I'll close this for now.

Please re-open if this is not satisfying.

> Parquet-protobuf support both protobuf 2 and protobuf 3
> ---
>
> Key: PARQUET-1259
> URL: https://issues.apache.org/jira/browse/PARQUET-1259
> Project: Parquet
>  Issue Type: New Feature
>Affects Versions: 1.10.0, 1.9.1
>Reporter: Qinghui Xu
>Priority: Major
>
> With the merge of pull request: 
> [https://github.com/apache/parquet-mr/pull/407,] now it is protobuf 3 used in 
> parquet-protobuf, and this implies that it cannot work in an environment 
> where people are using protobuf 2 in their own dependencies because there is 
> some new API / breaking change in protobuf 3. People have to face a 
> dependency version conflict with next parquet-protobuf release (e.g. 1.9.1 or 
> 1.10.0).
> What if we support both protobuf 2 and protobuf 3 by providing 
> parquet-protobuf and parquet-protobuf2?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Re: String interning in parquet-format

2018-04-03 Thread Julien Le Dem

The main reason for the string interning is saving memory. Some of the
early parquet design is using the column names in the metadata to refer to
columns. When deserializing metadata we have a new string instance when we
deserialize even though it is the same string. We don't need to rely on the
interning mechanism it was just convenient to dedup the strings.
If there's a better mechanism to do that then it is fine to replace
interning with it.
Cheers
Julien

On Tue, Apr 3, 2018 at 12:15 PM, Robert Kruszewski 
wrote:

> I have been pointed to https://github.com/apache/parquet-format/pull/2
> which is the orignal pr for parquet-11. Looking at
> http://hg.openjdk.java.net/jdk10/master/file/be620a591379/src/hotspot/
> share/gc/cms/concurrentMarkSweepGeneration.cpp#l2563
> and
> http://hg.openjdk.java.net/jdk10/master/file/be620a591379/src/hotspot/
> share/gc/cms/concurrentMarkSweepGeneration.cpp#l5261
> it does look like interned strings are very rarely gc'ed.
>
> On Tue, 3 Apr 2018 at 18:45 Robert Kruszewski  wrote:
>
> > Hi parquet-dev,
> >
> > I wanted to start a discussion around the existence of string interning
> in
> > the thrift protocol in parquet-format. I posted some links
> > https://issues.apache.org/jira/browse/PARQUET-1261 and while I haven't
> > done perf benchmarking I have previously seen interened strings to cause
> GC
> > overhead limit exceeded exceptions. Only reference I could find why this
> > has been added is reference to
> > https://issues.apache.org/jira/browse/PARQUET-11 which unfortunately
> > leads to deleted repo. Wonder if anyone remembers the exact details?
> >
> > If we deem string deduplication there to be necessary we should
> > investigate implementing simple cache instead. I'd hope we can simply get
> > rid of interning without much harm but would love to hear others
> opinions.
> >
> > Robert
> >
>

parquet sync happening now

2018-03-28 Thread Julien Le Dem

https://meet.google.com/xpc-gwie-sem

[jira] [Updated] (PARQUET-1222) Definition of float and double sort order is ambiguous

2018-03-13 Thread Julien Le Dem (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem updated PARQUET-1222:
---
Summary: Definition of float and double sort order is ambiguous  (was: 
Definition of float and double sort order is ambigious)

> Definition of float and double sort order is ambiguous
> --
>
> Key: PARQUET-1222
> URL: https://issues.apache.org/jira/browse/PARQUET-1222
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Zoltan Ivanfi
>Priority: Critical
> Fix For: format-2.5.0
>
>
> Currently parquet-format specifies the sort order for floating point numbers 
> as follows:
> {code:java}
>*   FLOAT - signed comparison of the represented value
>*   DOUBLE - signed comparison of the represented value
> {code}
> The problem is that the comparison of floating point numbers is only a 
> partial ordering with strange behaviour in specific corner cases. For 
> example, according to IEEE 754, -0 is neither less nor more than \+0 and 
> comparing NaN to anything always returns false. This ordering is not suitable 
> for statistics. Additionally, the Java implementation already uses a 
> different (total) ordering that handles these cases correctly but differently 
> than the C\+\+ implementations, which leads to interoperability problems.
> TypeDefinedOrder for doubles and floats should be deprecated and a new 
> TotalFloatingPointOrder should be introduced. The default for writing doubles 
> and floats would be the new TotalFloatingPointOrder. This ordering should be 
> effective and easy to implement in all programming languages.
> For reading existing stats created using TypeDefinedOrder, the following 
> compatibility rules should be applied:
> * When looking for NaN values, min and max should be ignored.
> * If the min is a NaN, it should be ignored.
> * If the max is a NaN, it should be ignored.
> * If the min is \+0, the row group may contain -0 values as well.
> * If the max is -0, the row group may contain \+0 values as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Parquet sync starting now

2018-03-13 Thread Julien Le Dem

https://meet.google.com/jpy-mump-ngc

[jira] [Resolved] (PARQUET-1135) upgrade thrift and protobuf dependencies

2018-03-09 Thread Julien Le Dem (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem resolved PARQUET-1135.

Resolution: Fixed

merged in:

https://github.com/apache/parquet-mr/commit/3d2d4fd1588c8eb3f67f34d75b66967d0c7b06b6

> upgrade thrift and protobuf dependencies
> 
>
> Key: PARQUET-1135
> URL: https://issues.apache.org/jira/browse/PARQUET-1135
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>        Reporter: Julien Le Dem
>        Assignee: Julien Le Dem
>Priority: Major
> Fix For: 1.9.1
>
>
> thrift 0.7.0 -> 0.9.3
>  protobuf 3.2 -> 3.5.1



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (PARQUET-1135) upgrade thrift and protobuf dependencies

2018-03-09 Thread Julien Le Dem (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem updated PARQUET-1135:
---
Fix Version/s: 1.9.1
  Description: 
thrift 0.7.0 -> 0.9.3
 protobuf 3.2 -> 3.5.1

  was:
thrift 0.7.0 -> 0.9.3
protobuf 3.2 -> 3.4


> upgrade thrift and protobuf dependencies
> 
>
> Key: PARQUET-1135
> URL: https://issues.apache.org/jira/browse/PARQUET-1135
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Julien Le Dem
>Assignee: Julien Le Dem
>Priority: Major
> Fix For: 1.9.1
>
>
> thrift 0.7.0 -> 0.9.3
>  protobuf 3.2 -> 3.5.1



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Re: Date for next Parquet sync

2018-03-08 Thread Julien Le Dem

Actually because of Daylight saving time we will have one less hour next
week.
https://www.timeanddate.com/worldclock/meetingdetails.html?year=2018=3=13=17=0=0=224=50=195
Location Local Time Time Zone UTC Offset
San Francisco (USA - California) Tuesday, March 13, 2018 at 10:00:00
am PDT UTC-7
hours
Budapest (Hungary) Tuesday, March 13, 2018 at 6:00:00 pm CET UTC+1 hour
Paris (France - Île-de-France) Tuesday, March 13, 2018 at 6:00:00 pm CET UTC+1
hour
Corresponding UTC (GMT) Tuesday, March 13, 2018 at 17:00:00

On Thu, Mar 8, 2018 at 4:12 PM, Julien Le Dem <julien.le...@gmail.com>
wrote:

> or 10am PST but it's a little late for the team in Budapest.
>
> On Thu, Mar 8, 2018 at 4:11 PM, Julien Le Dem <julien.le...@gmail.com>
> wrote:
>
>> I'm sorry, it turns out I now have a conflict at this particular time.
>> Maybe Wednesday?
>>
>> On Mon, Mar 5, 2018 at 10:55 AM, Lars Volker <l...@cloudera.com> wrote:
>>
>>> Hi All,
>>>
>>> It has been almost 3 weeks since the last sync and there are a bunch of
>>> ongoing discussions on the mailing list. Let's find a date for the next
>>> Parquet community sync. Last time we met on a Wednesday, so this time it
>>> should be Tuesday.
>>>
>>> I propose to meet next Tuesday, March 13th, at 6pm CET / 9am PST. That
>>> allows us to get back to the biweekly cadence without overlapping with
>>> the
>>> Arrow sync, which happens this week.
>>>
>>> Please speak up if that time does not work for you.
>>>
>>> Cheers, Lars
>>>
>>
>>
>

Re: Date for next Parquet sync

2018-03-08 Thread Julien Le Dem

or 10am PST but it's a little late for the team in Budapest.

On Thu, Mar 8, 2018 at 4:11 PM, Julien Le Dem <julien.le...@gmail.com>
wrote:

> I'm sorry, it turns out I now have a conflict at this particular time.
> Maybe Wednesday?
>
> On Mon, Mar 5, 2018 at 10:55 AM, Lars Volker <l...@cloudera.com> wrote:
>
>> Hi All,
>>
>> It has been almost 3 weeks since the last sync and there are a bunch of
>> ongoing discussions on the mailing list. Let's find a date for the next
>> Parquet community sync. Last time we met on a Wednesday, so this time it
>> should be Tuesday.
>>
>> I propose to meet next Tuesday, March 13th, at 6pm CET / 9am PST. That
>> allows us to get back to the biweekly cadence without overlapping with the
>> Arrow sync, which happens this week.
>>
>> Please speak up if that time does not work for you.
>>
>> Cheers, Lars
>>
>
>

Re: Date for next Parquet sync

2018-03-08 Thread Julien Le Dem

I'm sorry, it turns out I now have a conflict at this particular time.
Maybe Wednesday?

On Mon, Mar 5, 2018 at 10:55 AM, Lars Volker  wrote:

> Hi All,
>
> It has been almost 3 weeks since the last sync and there are a bunch of
> ongoing discussions on the mailing list. Let's find a date for the next
> Parquet community sync. Last time we met on a Wednesday, so this time it
> should be Tuesday.
>
> I propose to meet next Tuesday, March 13th, at 6pm CET / 9am PST. That
> allows us to get back to the biweekly cadence without overlapping with the
> Arrow sync, which happens this week.
>
> Please speak up if that time does not work for you.
>
> Cheers, Lars
>

Re: parquet sync

2018-02-14 Thread Julien Le Dem

Notes:
Attendees, Agenda:
Lars (Cloudera Impala): Zoltan proposal to get to a more stable release or
feature flag
Qinghui, Benoit, Miguel, Justin (Criteo): Pull request. Parquet-proto.
PARQUET-968
Gidon (IBM): encryption JIRA. On track
Ryan (Netflix): getting 1.10 out
Zoltan (Cloudera): column index fixes from Gabor, ideas on list
Anna (Cloudera): Compatibility issues.

Discussion:
Compatibility issues and flags:

   - Define standard flags for features that are supported or not:
   -
  - New Compression algorithms: Brotli, ZStandard, ...
  - New Encodings (since v1): Delta-int, …
   - Flags are standards across parquet implementations to limit usage of
   features to a set supported across all components
   - Define (a few) profiles with the sets of features supported for a
   given version (1.0, 2.0, 3.0)
   -
  - These are goals for any implementation to support.
   - To be discussed: optional features that can be ignored and don’t
   prevent reading the file (ex: bloom filters, page index)
   -  Zoltan: create jira and google doc with a design proposal

Parquet-proto:

   - Criteo to validate and give +1 :
   https://github.com/apache/parquet-mr/pull/411
   - New feature needed:
   -
  -  support: empty list vs null list.
  - Crate will Create jira and submit New PR

Column indexes: (By Gabor) PR: https://github.com/apache/parquet-mr/pull/456

   - Needs modification in parquet-format utils (not the thrift metadata)
   => new release
   - first version writing into parquet-mr
   - Action:
   -
  - Ryan to review
  - Ryan and Zoltan to follow up on making parquet-format release






On Wed, Feb 14, 2018 at 9:02 AM, Julien Le Dem <julien.le...@wework.com>
wrote:

> starting now on google hangout:
> https://meet.google.com/nhj-cvpt-atx
>

parquet sync

2018-02-14 Thread Julien Le Dem

starting now on google hangout:
https://meet.google.com/nhj-cvpt-atx

Re: Date and Time for next Parquet sync

2018-02-09 Thread Julien Le Dem

If you have received an invitation for next Wednesday, please disregard it
for now.
I was just adding people to the list of reminders.
I'll move it to whenever is the conclusion of this thread.
I have a conflict on Tuesday though.
I am available on Wednesday.

On Wed, Feb 7, 2018 at 11:29 PM, Gabor Szadovszky <
gabor.szadovs...@cloudera.com> wrote:

> Hi All,
>
> I would vote on Tuesday but don’t have any problem with skipping this one
> if Wednesday fits more for others.
>
> Cheers,
> Gabor
>
> > On 7 Feb 2018, at 19:00, Lars Volker  wrote:
> >
> > Hi All,
> >
> > I propose to have the next regular Parquet sync next week, either on
> > Tuesday or Wednesday at 9am PST / 6pm CET.
> >
> > The last one was on a Tuesday so this one would default to Wednesday.
> Let's
> > have a quick vote here by replying to this email with your day of choice.
> > Feel free to propose any other time if neither of these work for you.
> >
> > Cheers, Lars
>
>

Re: parquet sync

2018-01-30 Thread Julien Le Dem

notes:
Julien (Wework)
Gidon (IBM): secure analytics. JIRA + Draft
Ryan (Netflix): Parquet-787 needs review
Lars (Cloudera, Impala): Discuss Zoltan’s proposal. Feature sets
Jim (Cloudera, Impala): Bloom filters
Zoltan (Cloudera): Java 8 transition, breaking changes management
Gabor (Cloudera): column index implement in parquet-mr
Nandor (Cloudera)
Uwe (Blue Yonder)
Marcel

Agenda:

   -  Bloom filters: https://github.com/apache/parquet-cpp/pull/432
   -
  - Patch out for review for bloom filter in C++
  - Perf comp for Bloom filter and Dictionary?
  - Need guidance on bloom filter size and mechanism not to write too
  big a bloom filter.
  - Ryan to follow up
   - Proposition for secure analytics: PARQUET-1178
   -
  - Allow encryption while maintaining Parquet push down capabilities
  - Step 1: encryption with single key, allowing individual columns to
  be encrypted or not.
   - Java 8 transition:
   -
  - Will move Parquet to Java8
   - breaking changes management, feature set proposal from Zoltan
   -
  - Parquet
   - Parquet-787 needs review
   -
  - Works in production at Netflix
  - Please review and approve if appropriate
   - Next sync, Tuesday in 2 weeks.

On Tue, Jan 30, 2018 at 6:59 PM, Julien Le Dem <julien.le...@gmail.com>
wrote:

> happening now: meet.google.com/nhj-cvpt-atx
>

parquet sync

2018-01-30 Thread Julien Le Dem

happening now: meet.google.com/nhj-cvpt-atx

Re: Next parquet sync

2018-01-10 Thread Julien Le Dem

notes:
Agenda and attendees:

   -  Anuj Phadke (impala team)
   - Uwe (Blue Yonder, parquet-cpp):
   -
  - Discuss parquet dotnet project
   - Lars (Impala):
   -
  - timestamp int96. Deprecate ordering
   - Nandor (File format team in Cloudera)
   - Zoltan (Cloudera):
   -
  - discuss page size recommendation
   - Gabor (file formats)
   - Ryan (Netflix):
   -
  - Been working on better read api. Rewrite record construction in
  parquet-avro + 5%.
  - Discuss PARQUET-787 (how we build the decoders for byte arrays)
   - Marcel
   - Julien (Wework):
   -
  - releases


Agenda:

   - Deprecating ordering for int96 timestamp:
   https://issues.apache.org/jira/browse/PARQUET-1065
   
<https://meet.google.com/linkredirect?authuser=0=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FPARQUET-1065>
   -
  - It was decided to deprecate ordering for int96.
  -
 - Do not use existing min/max stat for int96
 - Label int96 as not supporting ordering
 - Do not write int96 anymore
 - Always support reading int96 for backward compatibility with
 existing files
  - PR from Zoltan. Change to change int96 ordering from Unsigned to
  undefined.
  - Lars: Impala actually uses int96 min/max and ordering and will do
  it for some time.
  - Conclusion:
  -
 - Add language to say writing int96 is not allowed (caveat that
 people can do things anyway). People should use 64bits
timestamps instead.
 - Add spec of int96 to doc with warning that the only purpose is
 to enable reading existing files.
 - PRs:
 -
- https://github.com/apache/parquet-format/pull/77
- https://github.com/apache/parquet-format/pull/49
- Action: lars to update #49



   - Parquet dotnet project
   -
  - Discussion on wether we should import it in the apache parquet
  project
  - General advice is to make sure the authors are engaged enough with
  the project to maintain it long term.
  - We should Keep reaching out and support this effort



   - Page size reco
   -
  - Zoltan: Create a JIRA.
  - Wait for page skipping implementation to get numbers on the impact
  of page size
  - Look at different strategies for page size (bytes before
  compression, #values, ...)
  - Make some measurements
  - Restart the conversation
   - PARQUET-787: needs a review

https://github.com/apache/parquet-mr/pull/390


   - Releases
   -
  - Ryan: create release jira



On Tue, Jan 9, 2018 at 8:54 AM, Julien Le Dem <julien.le...@wework.com>
wrote:

> The sync is starting in a few minutes:
> https://meet.google.com/cxa-nppv-caa
> (as a reminder, everybody is welcome to join if only to be a fly on the
> wall)
>
> On Tue, Jan 9, 2018 at 2:31 AM, Lars Volker <l...@cloudera.com> wrote:
>
>> Great, I sent out an invite. If anyone wants to join but was not on the
>> invite, please let me know.
>>
>> Cheers, Lars
>>
>> On Mon, Jan 8, 2018 at 10:24 PM, Julien Le Dem <julien.le...@wework.com>
>> wrote:
>>
>> > It sounds like we're doing the parquet sync tomorrow Tuesday January
>> 9th at
>> > 9am PT (5pm UTC)
>> >
>> > On Thu, Jan 4, 2018 at 9:17 AM, Marcel Kornacker <marc...@gmail.com>
>> > wrote:
>> >
>> > > My preference for next week would be Tuesday as well.
>> > >
>> > > On Thu, Jan 4, 2018 at 8:25 AM, Zoltan Ivanfi <z...@cloudera.com>
>> wrote:
>> > >
>> > > > Hi,
>> > > >
>> > > > According to the latest results of the availability poll, Tuesdays
>> > seems
>> > > > to work for slightly more people than Wednesdays. I'll try to post
>> the
>> > > > chart below, let's see whether the mailing list allows it or removes
>> > it:
>> > > > [image: pasted1]
>> > > >
>> > > > I would suggest to either use Tuesdays or alternate between Tuesdays
>> > and
>> > > > Wednesdays (since the group of 9 Tuesday voters does not contain
>> all 8
>> > > > Wednesday voters). The last sync was on Tuesdays, so the next can
>> be on
>> > > > Wednesday if you would like to follow this alternating scheme.
>> > > >
>> > > > Best regards,
>> > > >
>> > > > Zoltan
>> > > >
>> > > >
>> > > > On Thu, Jan 4, 2018 at 4:27 PM Wes McKinney <wesmck...@gmail.com>
>> > wrote:
>> > > >
>> > > >> We have been staggering the Arrow syncs by 1 week, also on
>> Wednesdays
>> > > >> at 9am PT. If you are going to have the next Parquet sync on 1/10,
>> we
>> > > >> would have the next Arrow sync on 1/17. Let me know what you prefer
>> > > >>
>> > > >> On Thu, Jan 4, 2018 at 4:10 AM, Lars Volker <l...@cloudera.com>
>> wrote:
>> > > >> > 1/10 would work for me.
>> > > >> >
>> > > >> > On Thu, Jan 4, 2018 at 3:22 AM, Julien Le Dem <
>> > julien.le...@gmail.com
>> > > >
>> > > >> > wrote:
>> > > >> >
>> > > >> >> Any day of the week/time preference for the next Parquet sync?
>> > > >> >> It is usually held at 9am PT (5pm UTC) on a Wednesday.
>> > > >> >>
>> > > >>
>> > > >
>> > >
>> >
>>
>
>

Re: Next parquet sync

2018-01-09 Thread Julien Le Dem

The sync is starting in a few minutes:
https://meet.google.com/cxa-nppv-caa
(as a reminder, everybody is welcome to join if only to be a fly on the
wall)

On Tue, Jan 9, 2018 at 2:31 AM, Lars Volker <l...@cloudera.com> wrote:

> Great, I sent out an invite. If anyone wants to join but was not on the
> invite, please let me know.
>
> Cheers, Lars
>
> On Mon, Jan 8, 2018 at 10:24 PM, Julien Le Dem <julien.le...@wework.com>
> wrote:
>
> > It sounds like we're doing the parquet sync tomorrow Tuesday January 9th
> at
> > 9am PT (5pm UTC)
> >
> > On Thu, Jan 4, 2018 at 9:17 AM, Marcel Kornacker <marc...@gmail.com>
> > wrote:
> >
> > > My preference for next week would be Tuesday as well.
> > >
> > > On Thu, Jan 4, 2018 at 8:25 AM, Zoltan Ivanfi <z...@cloudera.com> wrote:
> > >
> > > > Hi,
> > > >
> > > > According to the latest results of the availability poll, Tuesdays
> > seems
> > > > to work for slightly more people than Wednesdays. I'll try to post
> the
> > > > chart below, let's see whether the mailing list allows it or removes
> > it:
> > > > [image: pasted1]
> > > >
> > > > I would suggest to either use Tuesdays or alternate between Tuesdays
> > and
> > > > Wednesdays (since the group of 9 Tuesday voters does not contain all
> 8
> > > > Wednesday voters). The last sync was on Tuesdays, so the next can be
> on
> > > > Wednesday if you would like to follow this alternating scheme.
> > > >
> > > > Best regards,
> > > >
> > > > Zoltan
> > > >
> > > >
> > > > On Thu, Jan 4, 2018 at 4:27 PM Wes McKinney <wesmck...@gmail.com>
> > wrote:
> > > >
> > > >> We have been staggering the Arrow syncs by 1 week, also on
> Wednesdays
> > > >> at 9am PT. If you are going to have the next Parquet sync on 1/10,
> we
> > > >> would have the next Arrow sync on 1/17. Let me know what you prefer
> > > >>
> > > >> On Thu, Jan 4, 2018 at 4:10 AM, Lars Volker <l...@cloudera.com>
> wrote:
> > > >> > 1/10 would work for me.
> > > >> >
> > > >> > On Thu, Jan 4, 2018 at 3:22 AM, Julien Le Dem <
> > julien.le...@gmail.com
> > > >
> > > >> > wrote:
> > > >> >
> > > >> >> Any day of the week/time preference for the next Parquet sync?
> > > >> >> It is usually held at 9am PT (5pm UTC) on a Wednesday.
> > > >> >>
> > > >>
> > > >
> > >
> >
>

Re: Iceberg table format

2018-01-03 Thread Julien Le Dem

Happy new year!
I'm interested as well.
Did you get to publish your code on github?
Thanks

On Fri, Dec 8, 2017 at 8:42 AM, Ryan Blue  wrote:

> I'm working on getting the code out to our open source github org, probably
> early next week. I'll set up a mailing list for it as well.
>
> rb
>
> On Thu, Dec 7, 2017 at 6:38 PM, Jacques Nadeau  wrote:
>
> > Sounds super interesting. Would love to collaborate on this. Do you have
> a
> > repo or mailing list where you are working on this?
> >
> >
> >
> > On Wed, Dec 6, 2017 at 4:20 PM, Ryan Blue 
> > wrote:
> >
> >> Hi everyone,
> >>
> >> I mentioned in the sync-up this morning that I’d send out an
> introduction
> >> to the table format we’re working on, which we’re calling Iceberg.
> >>
> >> For anyone that wasn’t around here’s the background: there are several
> >> problems with how we currently manage data files to make up a table in
> the
> >> Hadoop ecosystem. The one that came up today was that you can’t actually
> >> update a table atomically to, for example, rewrite a file and safely
> >> delete
> >> records. That’s because Hive tables track what files are currently
> visible
> >> by listing partition directories, and we don’t have (or want)
> transactions
> >> for changes in Hadoop file systems. This means that you can’t actually
> >> have
> >> isolated commits to a table and the result is that *query results from
> >> Hive
> >> tables can be wrong*, though rarely in practice.
> >>
> >> The problems with current tables are caused primarily by keeping state
> >> about what files are in or not in a table in the file system. As I said,
> >> one problem is that there are no transactions but you also have to list
> >> directories to plan jobs (bad on S3) and rename files from a temporary
> >> location to a final location (really, really bad on S3).
> >>
> >> To avoid these problems we’ve been building the Iceberg format that
> tracks
> >> tracks every file in a table instead of tracking directories. Iceberg
> >> maintains snapshots of all the files in a dataset and atomically swaps
> >> snapshots and other metadata to commit. There are a few benefits to
> doing
> >> it this way:
> >>
> >>- *Snapshot isolation*: Readers always use a consistent snapshot of
> the
> >>table, without needing to hold a lock. All updates are atomic.
> >>- *O(1) RPCs to plan*: Instead of listing O(n) directories in a table
> >> to
> >>plan a job, reading a snapshot requires O(1) RPC calls
> >>- *Distributed planning*: File pruning and predicate push-down is
> >>distributed to jobs, removing the metastore bottleneck
> >>- *Version history and rollback*: Table snapshots are kept around and
> >> it
> >>is possible to roll back if a job has a bug and commits
> >>- *Finer granularity partitioning*: Distributed planning and O(1) RPC
> >>calls remove the current barriers to finer-grained partitioning
> >>
> >> We’re also taking this opportunity to fix a few other problems:
> >>
> >>- Schema evolution: columns are tracked by ID to support
> >> add/drop/rename
> >>- Types: a core set of types, thoroughly tested to work consistently
> >>across all of the supported data formats
> >>- Metrics: cost-based optimization metrics are kept in the snapshots
> >>- Portable spec: tables should not be tied to Java and should have a
> >>simple and clear specification for other implementers
> >>
> >> We have the core library to track files done, along with most of a
> >> specification, and a Spark datasource (v2) that can read Iceberg tables.
> >> I’ll be working on the write path next and we plan to build a Presto
> >> implementation soon.
> >>
> >> I think this should be useful to others and it would be great to
> >> collaborate with anyone that is interested.
> >>
> >> rb
> >> 
> >> --
> >> Ryan Blue
> >> Software Engineer
> >> Netflix
> >>
> >
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Next parquet sync

2018-01-03 Thread Julien Le Dem

Any day of the week/time preference for the next Parquet sync?
It is usually held at 9am PT (5pm UTC) on a Wednesday.

parquet sync starting now

2017-12-06 Thread Julien Le Dem

https://meet.google.com/ttv-rton-ber
(all welcome)

Re: parquet sync starting in a few minutes

2017-11-22 Thread Julien Le Dem

 Notes from the meeting

Attendees:
Julien (WeWork): release
Hakan (Criteo): moving to parquet.
Marcel (unaffiliated)
Lars (Impala, Cloudera): new statistics min_value/max_value fields in
parquet_v2.
Gabor (Cloudera): min/max stats impl., parquet-mr.
Zoltan (Cloudera): Min/max
Anna (Cloudera): Min/Max
Uwe (BlueYonder)
Vuk Ercegovac (Cloudera)
Ryan (Netflix): getting reviews /429, parquet 2.0 reviews
Eric Owhadi (Trafodion): page level filtering. Min/max

Min_value/max_value implementation:
 https://issues.apache.org/jira/browse/PARQUET-1025
<https://meet.google.com/linkredirect?authuser=1=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FPARQUET-1025>

   - We should deprecate compareTo in Binary since it is at the physical
   type level when ordering is a logical type notion
  - We discussed a possible better implementation of compareTo that
  would take the LogicalType into account but agreed this would be
a separate
  effort
   - Add a Comparator based on the logical type that is the preferred way
   of comparing 2 values
   - stats writer implementation:
  - The preferred implementation is for writers to implement the new
  min_value/max_value metadata field instead of old min/max
independently of
  the version.
   -
  - Optionally writers might decide to also populate min/max for
  compatibility with older tools but we should do this only if the need
  arises.
   - Action: provide feedback on the JIRA above (PARQUET-1025)

Ryan has two PRs for review:

   - Make sure the Hadoop api does not leak through the Parquet api.
   https://github.com/apache/parquet-mr/pull/429
   - Improved Read allocation API:
   https://github.com/apache/parquet-mr/pull/390

Action: give feedback on pull requests.

next meeting in 2 weeks. same time.

On Wed, Nov 22, 2017 at 8:57 AM, Julien Le Dem <julien.le...@gmail.com>
wrote:

> https://meet.google.com/udi-dvmo-sva
>

parquet sync starting in a few minutes

2017-11-22 Thread Julien Le Dem

https://meet.google.com/udi-dvmo-sva

parquet sync starting now

2017-11-08 Thread Julien Le Dem

https://meet.google.com/oto-xpdf-kug

[Announce] Congrats to our new Parquet committers

2017-10-27 Thread Julien Le Dem

Zolta Ivanfi and Lars Volker are now Parquet committers.
Deepak Majeti became a committer in July.
Thank you all for your sustained contribution to the project.
Welcome and congrats!

Parquet sync

2017-10-25 Thread Julien Le Dem

Starting now:
https://meet.google.com/oto-xpdf-kug

Re: parquet sync starting now

2017-10-11 Thread Julien Le Dem

Attendees/agenda:

Santlal

Deepak (Vertical): deprecation of older compression.

Lars (Cloudera, Impala): Column indexes

Marcel: Column indexes

Ryan (Netflix): release parquet-format 2.4.0. need help on java side.
parquet related table format (id based column projection)

Jim (Cloudera)

Zoltan (Cloudera)

Anna (Cloudera)

Julien:

New compression alg / Deprecation of older compression:

 - we can't remove algos that have been used (lzo, brotli). We can add
recommendation on algo to use.

 - added language to clarify support of algorithms plus dependency on
installing some.

 - LZ4 widely available

 - zstandard harder to install but better.

Column indexes:

 - action: make max always present

 - always have min and max values (max not optional)

 - add metadata to capture if min/max are ordered. enum.

 - clarify meaning of null page.

 - todo: update PR and merge soon.

parquet-format release: blocked on page index

parquet related table format discussion: will happen separately.

next meeting in 2 weeks.

On Wed, Oct 11, 2017 at 9:06 AM, Julien Le Dem <julien.le...@gmail.com>
wrote:

> https://meet.google.com/oto-xpdf-kug
>

1 2 3 4 >

1 - 100 of 371 matches

Mail list logo