[jira] [Commented] (PARQUET-1698) [C++] Add reader option to pre-buffer entire serialized row group into memory

2020-01-13 Thread Jacques Nadeau (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17014777#comment-17014777
 ] 

Jacques Nadeau commented on PARQUET-1698:
-

In our internal work we actually separate this out between the responsibility 
of the IO layer and the reader. This allows the IO layer to figure out how 
aggressively it reads ahead based on other attributes such as current memory 
availability. I suggest separating it out that way as opposed to the row group 
reader actually doing this work itself. Basically, you want to say: I have 
several independent streams within the same file that I will probably read with 
these ranges--don't blow the readahead. You could then have one implementation 
that just does full prebuffering and another which is more conservative with 
memory.

> [C++] Add reader option to pre-buffer entire serialized row group into memory
> -
>
> Key: PARQUET-1698
> URL: https://issues.apache.org/jira/browse/PARQUET-1698
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Zherui Cao
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In some scenarios (example: reading datasets from Amazon S3), reading columns 
> independently and allowing unbridled {{Read}} calls to the underlying file 
> handle can yield suboptimal performance. In such cases, it may be preferable 
> to first read the entire serialized row group into memory then deserialize 
> the constituent columns from this
> Note that such an option would not be appropriate as a default behavior for 
> all file handle types since low-selectivity reads (example: reading only 3 
> columns out of a file with 100 columns)  will be suboptimal in some cases. I 
> think it would be better for "high latency" file systems to opt into this 
> option
> cc [~fsaintjacques] [~bkietz] [~apitrou]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1475) DirectCodecFactory's ParquetCompressionCodecException drops a passed in cause in one constructor

2018-12-11 Thread Jacques Nadeau (JIRA)
Jacques Nadeau created PARQUET-1475:
---

 Summary: DirectCodecFactory's ParquetCompressionCodecException 
drops a passed in cause in one constructor
 Key: PARQUET-1475
 URL: https://issues.apache.org/jira/browse/PARQUET-1475
 Project: Parquet
  Issue Type: Bug
Reporter: Jacques Nadeau


[https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/DirectCodecFactory.java#L521]

 

Cause is not actually passed to super.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1154) [C++] Add function to concatenate a collection of Parquet files to create a new single file

2017-11-04 Thread Jacques Nadeau (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16239319#comment-16239319
 ] 

Jacques Nadeau commented on PARQUET-1154:
-

As an aside, it would be really nice if this was a streaming operation that 
required minimal memory buffering (by doing it column-at-a-time).

> [C++] Add function to concatenate a collection of Parquet files to create a 
> new single file
> ---
>
> Key: PARQUET-1154
> URL: https://issues.apache.org/jira/browse/PARQUET-1154
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: Wes McKinney
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (PARQUET-1028) [JAVA] When reading old Spark-generated files with INT96, stats are reported as valid when they aren't

2017-06-09 Thread Jacques Nadeau (JIRA)
Jacques Nadeau created PARQUET-1028:
---

 Summary: [JAVA] When reading old Spark-generated files with INT96, 
stats are reported as valid when they aren't 
 Key: PARQUET-1028
 URL: https://issues.apache.org/jira/browse/PARQUET-1028
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Affects Versions: 1.9.0
Reporter: Jacques Nadeau
 Fix For: 1.9.1


Found that the condition 
[here|https://github.com/apache/parquet-mr/blob/9d58b6a83aa79dcad01c3bcc2ec0a7db74ba83b1/parquet-column/src/main/java/org/apache/parquet/CorruptStatistics.java#L55]
 is missing a check for INT96. Since INT96 statis are also corrupt with old 
versions of Parquet, the code here shouldn't short-circuit return.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PARQUET-369) Shading SLF4J prevents SLF4J locating org.slf4j.impl.StaticLoggerBinder

2015-12-15 Thread Jacques Nadeau (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15059056#comment-15059056
 ] 

Jacques Nadeau commented on PARQUET-369:


+1 for Ryan's suggestion. Not sure how many Java users exist that depend on the 
format but not MR (are there any?). Might as well use slf4j for both.

> Shading SLF4J prevents SLF4J locating org.slf4j.impl.StaticLoggerBinder
> ---
>
> Key: PARQUET-369
> URL: https://issues.apache.org/jira/browse/PARQUET-369
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Cheng Lian
>Assignee: Ryan Blue
> Fix For: format-2.3.1
>
>
> Parquet-format shades SLF4J to {{parquet.org.slf4j}} (see 
> [here|https://github.com/apache/parquet-format/blob/apache-parquet-format-2.3.0/pom.xml#L162]).
>  This also accidentally shades [this 
> line|https://github.com/qos-ch/slf4j/blob/v_1.7.2/slf4j-api/src/main/java/org/slf4j/LoggerFactory.java#L207]
> {code}
> private static String STATIC_LOGGER_BINDER_PATH = 
> "org/slf4j/impl/StaticLoggerBinder.class";
> {code}
> to
> {code}
> private static String STATIC_LOGGER_BINDER_PATH = 
> "parquet/org/slf4j/impl/StaticLoggerBinder.class";
> {code}
> and thus {{LoggerFactory}} can never find the correct {{StaticLoggerBinder}} 
> implementation even if we provide dependencies like {{slf4j-log4j12}} on the 
> classpath.
> This happens in Spark. Whenever we write a Parquet file, we see the following 
> famous message and can never get rid of it:
> {noformat}
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)