Re: [PR] PARQUET-2409: Add custom .asf.yaml for email notification [parquet-format]

2023-12-05 Thread via GitHub


wgtmac merged PR #224:
URL: https://github.com/apache/parquet-format/pull/224


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: Files with inconsistent num_rows and num_values?

2023-12-05 Thread Micah Kornfield
Thanks for checking.

On Tuesday, December 5, 2023, Gang Wu  wrote:

> I scanned through the parquet-mr implementation. It provides a row-wise
> interface to write records in the ColumnWriter. This cannot reproduce
> the issue in this thread. I suspect some other implementations may have
> their own column-wise column writer implementations and only write pages
> to the parquet-mr layer.
>
> Best,
> Gang
>
> On Wed, Nov 29, 2023 at 2:14 PM Micah Kornfield 
> wrote:
>
> > Hi Gang,
> > For writes I'm seeing "parquet-mr version 1.11.1" and "parquet-mr version
> > 1.10.1".  I need to look more into the page headers to check for
> > consistency.  At the column level, in some cases the number of values
> read
> > by pyarrow is consistent with num_rows and in some cases it is consistent
> > with num_values. I don't see any discernable pattern based on schema or
> > types.
> >
> > It looks like the parquet files might have been written with
> > avro ("parquet.avro.schema" key and a corresponding schema are present in
> > their metadata).
> >
> > Thanks,
> > Micah
> >
> > On Tue, Nov 28, 2023 at 6:30 PM Gang Wu  wrote:
> >
> > > Hi Micah,
> > >
> > > Does the FileMetaData.version [1] provide any information about
> > > the writer? What about the num_values in each page header? Is
> > > the actual number of values consistent with num_values in the
> > > ColumnMetaData?
> > >
> > > [1]
> > >
> > >
> > https://github.com/apache/parquet-format/blob/master/
> src/main/thrift/parquet.thrift#L1108
> > >
> > > Best,
> > > Gang
> > >
> > > On Wed, Nov 29, 2023 at 2:22 AM Micah Kornfield  >
> > > wrote:
> > >
> > > > We've recently encountered files that have inconsistencies between
> the
> > > > number of rows specified in the row group [1] and the total number of
> > > > values in a column [2] for non-repeated columns (within a file there
> is
> > > > inconsistency between columns but all counts appear to be greater
> than
> > or
> > > > equal to the number of rows). .
> > > >
> > > > Two questions:
> > > > 1.  Is anyone aware of parquet implementations that might generate
> > files
> > > > like this?
> > > > 2.  Does anyone have an opinion on the correct interpretation of
> these
> > > > files?  Should the files be treated as corrupt, or should the number
> of
> > > > rows be treated as authoritative and any additional data in a column
> be
> > > > truncated?
> > > >
> > > > It appears different engines make different choices in this case.
> > Arrow
> > > > treats this as corruption. Spark seems to allow reading the data.
> > > >
> > > > Thanks,
> > > > Micah
> > > >
> > > >
> > > > [1]
> > > >
> > > >
> > >
> > https://github.com/apache/parquet-format/blob/master/
> src/main/thrift/parquet.thrift#L895
> > > > [2]
> > > >
> > > >
> > >
> > https://github.com/apache/parquet-format/blob/master/
> src/main/thrift/parquet.thrift#L786
> > > >
> > >
> >
>


Re: Files with inconsistent num_rows and num_values?

2023-12-05 Thread Gang Wu
I scanned through the parquet-mr implementation. It provides a row-wise
interface to write records in the ColumnWriter. This cannot reproduce
the issue in this thread. I suspect some other implementations may have
their own column-wise column writer implementations and only write pages
to the parquet-mr layer.

Best,
Gang

On Wed, Nov 29, 2023 at 2:14 PM Micah Kornfield 
wrote:

> Hi Gang,
> For writes I'm seeing "parquet-mr version 1.11.1" and "parquet-mr version
> 1.10.1".  I need to look more into the page headers to check for
> consistency.  At the column level, in some cases the number of values read
> by pyarrow is consistent with num_rows and in some cases it is consistent
> with num_values. I don't see any discernable pattern based on schema or
> types.
>
> It looks like the parquet files might have been written with
> avro ("parquet.avro.schema" key and a corresponding schema are present in
> their metadata).
>
> Thanks,
> Micah
>
> On Tue, Nov 28, 2023 at 6:30 PM Gang Wu  wrote:
>
> > Hi Micah,
> >
> > Does the FileMetaData.version [1] provide any information about
> > the writer? What about the num_values in each page header? Is
> > the actual number of values consistent with num_values in the
> > ColumnMetaData?
> >
> > [1]
> >
> >
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1108
> >
> > Best,
> > Gang
> >
> > On Wed, Nov 29, 2023 at 2:22 AM Micah Kornfield 
> > wrote:
> >
> > > We've recently encountered files that have inconsistencies between the
> > > number of rows specified in the row group [1] and the total number of
> > > values in a column [2] for non-repeated columns (within a file there is
> > > inconsistency between columns but all counts appear to be greater than
> or
> > > equal to the number of rows). .
> > >
> > > Two questions:
> > > 1.  Is anyone aware of parquet implementations that might generate
> files
> > > like this?
> > > 2.  Does anyone have an opinion on the correct interpretation of these
> > > files?  Should the files be treated as corrupt, or should the number of
> > > rows be treated as authoritative and any additional data in a column be
> > > truncated?
> > >
> > > It appears different engines make different choices in this case.
> Arrow
> > > treats this as corruption. Spark seems to allow reading the data.
> > >
> > > Thanks,
> > > Micah
> > >
> > >
> > > [1]
> > >
> > >
> >
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L895
> > > [2]
> > >
> > >
> >
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L786
> > >
> >
>


Re: Fast nullify of columns?

2023-12-05 Thread Gang Wu
Hi Paul,

I agree there are better ways to do this, e.g. we can prepare encoded
definition levels and repetition levels (if they exist) and directly write
the
page. However, we need to take care of other rewrite configurations
including data page version (v1 or v2), compression, page statistics and
page index. By writing null records, the writer handles all the above
details
internally.

BTW, IMO writing `empty` pages may break the specs and fail the reader.

Best,
Gang

On Mon, Dec 4, 2023 at 5:30 PM Paul Rooney  wrote:

> Could anyone suggest a faster way to Nullify columns in a parquet file?
>
> My dataset consists of a lot of parquet files.
> Each of them having roughly 12 million rows and 350 columns. Being split in
> 2 Row groups of 10 million and 2 million rows.
>
> For each file I need to nullify 150 columns and rewrite the files.
>
> I tried using 'nullifyColumn' in
>
> 'parquet-mr/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java'
> But I find it slow as for each column, it iterates on the number of rows
> and calls ColumnWriter.writeNull
>
> Would anyone have suggestions on how to avoid all the iteration?
>
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java#L743C13
> ' for (int i = 0; i < totalChunkValues; i++) {...'
>
> Could a single call be made per column + row-group to write enough
> information to:
> A) keep the column present (in schema and as a Column chunk)
> B) set Column rowCount and num_nulls= totalChunkValues
>
>
> e.g. perhaps write a single 'empty' page which has:
> 1) valueCount and rowCount = totalChunkValues
> 2) Statistics.num_nulls set to totalChunkValues
>
> Thanks, Paul
>


Re: [PR] PARQUET-2409: Add custom .asf.yaml for email notification [parquet-format]

2023-12-05 Thread via GitHub


wgtmac commented on PR #224:
URL: https://github.com/apache/parquet-format/pull/224#issuecomment-1841970937

   Thanks for the suggestion! @kou 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: JIRA work log updates

2023-12-05 Thread Gang Wu
Hi Atour,

Recently I tried to migrate notifications to different mailing lists by
adding
a customized asf.yaml file [1]. I may have added too verbose settings to
the jira_options key. Let me fix this.

[1]
https://github.com/apache/parquet-mr/commit/2d10c282f14b6b34a7f3d0a6bf227b881bef5ad5


Best,
Gang

On Wed, Dec 6, 2023 at 12:28 AM Atour Mousavi Gourabi 
wrote:

> Hi everyone,
>
> I've been getting [jira] [Work logged] updates in my inbox after comments
> are added on parquet-mr GitHub PRs for which I am watching the JIRA tickets
> since short. I suspect this behaviour was introduced at PARQUET-2407
> (apache/parquet-mr#1232). As far as I'm aware, we did not log 10 minutes of
> work via the bot after a GitHub comment before. Do we want to keep sending
> emails and logging GitHub comments like this in the future?
>
> All the best,
> Atour
>


JIRA work log updates

2023-12-05 Thread Atour Mousavi Gourabi
Hi everyone,

I've been getting [jira] [Work logged] updates in my inbox after comments are 
added on parquet-mr GitHub PRs for which I am watching the JIRA tickets since 
short. I suspect this behaviour was introduced at PARQUET-2407 
(apache/parquet-mr#1232). As far as I'm aware, we did not log 10 minutes of 
work via the bot after a GitHub comment before. Do we want to keep sending 
emails and logging GitHub comments like this in the future?

All the best,
Atour


Re: [PR] PARQUET-2409: Add custom .asf.yaml for email notification [parquet-format]

2023-12-05 Thread via GitHub


wgtmac commented on PR #224:
URL: https://github.com/apache/parquet-format/pull/224#issuecomment-1841024568

   > @kou Does the file syntax look ok?
   
   I copied this from https://github.com/apache/zookeeper/blob/master/.asf.yaml


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] PARQUET-2409: Add custom .asf.yaml for email notification [parquet-format]

2023-12-05 Thread via GitHub


pitrou commented on PR #224:
URL: https://github.com/apache/parquet-format/pull/224#issuecomment-1841021328

   @kou Does the file syntax look ok?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] MINOR: Fix INTERVAL sort order doc in parquet.thrift to be undefined [parquet-format]

2023-12-05 Thread via GitHub


wgtmac merged PR #222:
URL: https://github.com/apache/parquet-format/pull/222


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] PARQUET-2409: Add custom .asf.yaml for email notification [parquet-format]

2023-12-05 Thread via GitHub


wgtmac commented on PR #224:
URL: https://github.com/apache/parquet-format/pull/224#issuecomment-1841016437

   PTAL. Thanks! @pitrou @gszadovszky @shangxinli 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] PARQUET-2409: Add custom .asf.yaml for email notification [parquet-format]

2023-12-05 Thread via GitHub


wgtmac opened a new pull request, #224:
URL: https://github.com/apache/parquet-format/pull/224

   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-2409
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (PARQUET-2398) Make static variables final

2023-12-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated PARQUET-2398:

Labels: pull-request-available  (was: )

> Make static variables final
> ---
>
> Key: PARQUET-2398
> URL: https://issues.apache.org/jira/browse/PARQUET-2398
> Project: Parquet
>  Issue Type: Improvement
>Affects Versions: 1.13.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.14.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)