[jira] [Commented] (PARQUET-1698) [C++] Add reader option to pre-buffer entire serialized row group into memory
[ https://issues.apache.org/jira/browse/PARQUET-1698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17014777#comment-17014777 ] Jacques Nadeau commented on PARQUET-1698: - In our internal work we actually separate this out between the responsibility of the IO layer and the reader. This allows the IO layer to figure out how aggressively it reads ahead based on other attributes such as current memory availability. I suggest separating it out that way as opposed to the row group reader actually doing this work itself. Basically, you want to say: I have several independent streams within the same file that I will probably read with these ranges--don't blow the readahead. You could then have one implementation that just does full prebuffering and another which is more conservative with memory. > [C++] Add reader option to pre-buffer entire serialized row group into memory > - > > Key: PARQUET-1698 > URL: https://issues.apache.org/jira/browse/PARQUET-1698 > Project: Parquet > Issue Type: Improvement > Components: parquet-cpp >Reporter: Wes McKinney >Assignee: Zherui Cao >Priority: Major > Labels: pull-request-available > Fix For: cpp-1.6.0 > > Time Spent: 10m > Remaining Estimate: 0h > > In some scenarios (example: reading datasets from Amazon S3), reading columns > independently and allowing unbridled {{Read}} calls to the underlying file > handle can yield suboptimal performance. In such cases, it may be preferable > to first read the entire serialized row group into memory then deserialize > the constituent columns from this > Note that such an option would not be appropriate as a default behavior for > all file handle types since low-selectivity reads (example: reading only 3 > columns out of a file with 100 columns) will be suboptimal in some cases. I > think it would be better for "high latency" file systems to opt into this > option > cc [~fsaintjacques] [~bkietz] [~apitrou] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-1475) DirectCodecFactory's ParquetCompressionCodecException drops a passed in cause in one constructor
Jacques Nadeau created PARQUET-1475: --- Summary: DirectCodecFactory's ParquetCompressionCodecException drops a passed in cause in one constructor Key: PARQUET-1475 URL: https://issues.apache.org/jira/browse/PARQUET-1475 Project: Parquet Issue Type: Bug Reporter: Jacques Nadeau [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/DirectCodecFactory.java#L521] Cause is not actually passed to super. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1154) [C++] Add function to concatenate a collection of Parquet files to create a new single file
[ https://issues.apache.org/jira/browse/PARQUET-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16239319#comment-16239319 ] Jacques Nadeau commented on PARQUET-1154: - As an aside, it would be really nice if this was a streaming operation that required minimal memory buffering (by doing it column-at-a-time). > [C++] Add function to concatenate a collection of Parquet files to create a > new single file > --- > > Key: PARQUET-1154 > URL: https://issues.apache.org/jira/browse/PARQUET-1154 > Project: Parquet > Issue Type: New Feature > Components: parquet-cpp >Reporter: Wes McKinney > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (PARQUET-1028) [JAVA] When reading old Spark-generated files with INT96, stats are reported as valid when they aren't
Jacques Nadeau created PARQUET-1028: --- Summary: [JAVA] When reading old Spark-generated files with INT96, stats are reported as valid when they aren't Key: PARQUET-1028 URL: https://issues.apache.org/jira/browse/PARQUET-1028 Project: Parquet Issue Type: Bug Components: parquet-mr Affects Versions: 1.9.0 Reporter: Jacques Nadeau Fix For: 1.9.1 Found that the condition [here|https://github.com/apache/parquet-mr/blob/9d58b6a83aa79dcad01c3bcc2ec0a7db74ba83b1/parquet-column/src/main/java/org/apache/parquet/CorruptStatistics.java#L55] is missing a check for INT96. Since INT96 statis are also corrupt with old versions of Parquet, the code here shouldn't short-circuit return. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (PARQUET-369) Shading SLF4J prevents SLF4J locating org.slf4j.impl.StaticLoggerBinder
[ https://issues.apache.org/jira/browse/PARQUET-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15059056#comment-15059056 ] Jacques Nadeau commented on PARQUET-369: +1 for Ryan's suggestion. Not sure how many Java users exist that depend on the format but not MR (are there any?). Might as well use slf4j for both. > Shading SLF4J prevents SLF4J locating org.slf4j.impl.StaticLoggerBinder > --- > > Key: PARQUET-369 > URL: https://issues.apache.org/jira/browse/PARQUET-369 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Cheng Lian >Assignee: Ryan Blue > Fix For: format-2.3.1 > > > Parquet-format shades SLF4J to {{parquet.org.slf4j}} (see > [here|https://github.com/apache/parquet-format/blob/apache-parquet-format-2.3.0/pom.xml#L162]). > This also accidentally shades [this > line|https://github.com/qos-ch/slf4j/blob/v_1.7.2/slf4j-api/src/main/java/org/slf4j/LoggerFactory.java#L207] > {code} > private static String STATIC_LOGGER_BINDER_PATH = > "org/slf4j/impl/StaticLoggerBinder.class"; > {code} > to > {code} > private static String STATIC_LOGGER_BINDER_PATH = > "parquet/org/slf4j/impl/StaticLoggerBinder.class"; > {code} > and thus {{LoggerFactory}} can never find the correct {{StaticLoggerBinder}} > implementation even if we provide dependencies like {{slf4j-log4j12}} on the > classpath. > This happens in Spark. Whenever we write a Parquet file, we see the following > famous message and can never get rid of it: > {noformat} > SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". > SLF4J: Defaulting to no-operation (NOP) logger implementation > SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further > details. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)