[jira] [Resolved] (PARQUET-852) Slowly ramp up sizes of byte[] in ByteBasedBitPackingEncoder
[ https://issues.apache.org/jira/browse/PARQUET-852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Le Dem resolved PARQUET-852. --- Resolution: Fixed Fix Version/s: 1.10.0 Issue resolved by pull request 401 [https://github.com/apache/parquet-mr/pull/401] > Slowly ramp up sizes of byte[] in ByteBasedBitPackingEncoder > > > Key: PARQUET-852 > URL: https://issues.apache.org/jira/browse/PARQUET-852 > Project: Parquet > Issue Type: Improvement >Reporter: John Jenkins >Priority: Minor > Fix For: 1.10.0 > > > The current allocation policy for ByteBasedBitPackingEncoder is to allocate > 64KB * #bits up-front. As similarly observed in [PARQUET-580], this can lead > to significant memory overheads for high-fanout scenarios (many columns > and/or open files, in my case using BooleanPlainValuesWriter). > As done in [PARQUET-585], I'll follow up with a PR that starts with a smaller > buffer and works its way up to a max. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Resolved] (PARQUET-196) parquet-tools command to get rowcount & size
[ https://issues.apache.org/jira/browse/PARQUET-196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Le Dem resolved PARQUET-196. --- Resolution: Fixed Fix Version/s: (was: 1.6.0) 1.10.0 Issue resolved by pull request 406 [https://github.com/apache/parquet-mr/pull/406] > parquet-tools command to get rowcount & size > > > Key: PARQUET-196 > URL: https://issues.apache.org/jira/browse/PARQUET-196 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.6.0 >Reporter: Swapnil >Priority: Minor > Labels: features > Fix For: 1.10.0 > > Original Estimate: 10m > Remaining Estimate: 10m > > Parquet files contain metadata about rowcount & file size. We should have new > commands to get rows count & size. > These command can be added in parquet-tools: > 1. rowcount : This should add number of rows in all footers to give total > rows in data. > 2. size : This should give compresses size in bytes and human readable format > too. > These command helps us to avoid parsing job logs or loading data once again > to find number of rows in data. This comes very handy in complex processes, > stats generation, QA etc.. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Resolved] (PARQUET-969) Decimal datatype support for parquet-tools output
[ https://issues.apache.org/jira/browse/PARQUET-969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Le Dem resolved PARQUET-969. --- Resolution: Fixed Fix Version/s: 1.10.0 Issue resolved by pull request 412 [https://github.com/apache/parquet-mr/pull/412] > Decimal datatype support for parquet-tools output > - > > Key: PARQUET-969 > URL: https://issues.apache.org/jira/browse/PARQUET-969 > Project: Parquet > Issue Type: Improvement >Reporter: Dan Fowler >Priority: Minor > Fix For: 1.10.0 > > > parquet-tools cat outputs decimal datatypes in binary/bytearray format. I > would like to have the decimal datatypes converted to their actual number > representation so that when parquet data is output from parquet-tools > decimals will be numbers. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (PARQUET-980) Cannot read row group larger than 2GB
[ https://issues.apache.org/jira/browse/PARQUET-980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell updated PARQUET-980: -- Description: Parquet MR 1.8.2 does not support reading row groups which are larger than 2 GB. See:https://github.com/apache/parquet-mr/blob/parquet-1.8.x/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L1064 We are seeing this when writing skewed records. This throws off the estimation of the memory check interval in the InternalParquetRecordWriter. The following spark code illustrates this: {noformat} /** * Create a data frame that will make parquet write a file with a row group larger than 2 GB. Parquet * only checks the size of the row group after writing a number of records. This number is based on * average row size of the already written records. This is problematic in the following scenario: * - The initial (100) records in the record group are relatively small. * - The InternalParquetRecordWriter checks if it needs to write to disk (it should not), it assumes * that the remaining records have a similar size, and (greatly) increases the check interval (usually * to 1). * - The remaining records are much larger then expected, making the row group larger than 2 GB (which * makes reading the row group impossible). * * The data frame below illustrates such a scenario. This creates a row group of approximately 4GB. */ val badDf = spark.range(0, 2200, 1, 1).mapPartitions { iterator => var i = 0 val random = new scala.util.Random(42) val buffer = new Array[Char](75) iterator.map { id => // the first 200 records have a length of 1K and the remaining 2000 have a length of 750K. val numChars = if (i < 200) 1000 else 75 i += 1 // create a random array var j = 0 while (j < numChars) { // Generate a char (borrowed from scala.util.Random) buffer(j) = (random.nextInt(0xD800 - 1) + 1).toChar j += 1 } // create a string: the string constructor will copy the buffer. new String(buffer, 0, numChars) } } badDf.write.parquet("somefile") val corruptedDf = spark.read.parquet("somefile") corruptedDf.select(count(lit(1)), max(length($"value"))).show() {noformat} The latter fails with the following exception: {noformat} java.lang.NegativeArraySizeException at org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1064) at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:698) ... {noformat} -This seems to be fixed by commit https://github.com/apache/parquet-mr/commit/6b605a4ea05b66e1a6bf843353abcb4834a4ced8 in parquet 1.9.x. Is there any chance that we can fix this in 1.8.x?- was: Parquet MR 1.8.2 does not support reading row groups which are larger than 2 GB. See:https://github.com/apache/parquet-mr/blob/parquet-1.8.x/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L1064 We are seeing this when writing skewed records. This throws off the estimation of the memory check interval in the InternalParquetRecordWriter. The following spark code illustrates this: {noformat} /** * Create a data frame that will make parquet write a file with a row group larger than 2 GB. Parquet * only checks the size of the row group after writing a number of records. This number is based on * average row size of the already written records. This is problematic in the following scenario: * - The initial (100) records in the record group are relatively small. * - The InternalParquetRecordWriter checks if it needs to write to disk (it should not), it assumes * that the remaining records have a similar size, and (greatly) increases the check interval (usually * to 1). * - The remaining records are much larger then expected, making the row group larger than 2 GB (which * makes reading the row group impossible). * * The data frame below illustrates such a scenario. This creates a row group of approximately 4GB. */ val badDf = spark.range(0, 2200, 1, 1).mapPartitions { iterator => var i = 0 val random = new scala.util.Random(42) val buffer = new Array[Char](75) iterator.map { id => // the first 200 records have a length of 1K and the remaining 2000 have a length of 750K. val numChars = if (i < 200) 1000 else 75 i += 1 // create a random array var j = 0 while (j < numChars) { // Generate a char (borrowed from scala.util.Random) buffer(j) = (random.nextInt(0xD800 - 1) + 1).toChar j += 1 } // create a string: the string constructor will copy the buffer. new String(buffer, 0, numChars) } } badDf.write.parquet("somefile") val corruptedDf = spark.read.parquet("somefile") corruptedDf.select(count(lit(1)), max(length($"value"))).show() {noformat} The latter fails with the following
[jira] [Resolved] (PARQUET-930) [C++] Account for all Arrow date/time types
[ https://issues.apache.org/jira/browse/PARQUET-930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe L. Korn resolved PARQUET-930. - Resolution: Fixed Issue resolved by pull request 321 [https://github.com/apache/parquet-cpp/pull/321] > [C++] Account for all Arrow date/time types > > > Key: PARQUET-930 > URL: https://issues.apache.org/jira/browse/PARQUET-930 > Project: Parquet > Issue Type: New Feature >Reporter: Wes McKinney >Assignee: Uwe L. Korn > Fix For: cpp-1.1.0 > > > Arrow 0.3 has some additional date / time array and metadata that we need to > support more completely -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Resolved] (PARQUET-977) Improve MSVC build
[ https://issues.apache.org/jira/browse/PARQUET-977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe L. Korn resolved PARQUET-977. - Resolution: Fixed Fix Version/s: cpp-1.1.0 Issue resolved by pull request 320 [https://github.com/apache/parquet-cpp/pull/320] > Improve MSVC build > -- > > Key: PARQUET-977 > URL: https://issues.apache.org/jira/browse/PARQUET-977 > Project: Parquet > Issue Type: Improvement > Components: parquet-cpp > Environment: windows / msvc >Reporter: rip.nsk > Fix For: cpp-1.1.0 > > > I'm going to improve and cleanup msvc build of parquet-cpp. -- This message was sent by Atlassian JIRA (v6.3.15#6346)