[jira] [Resolved] (PARQUET-852) Slowly ramp up sizes of byte[] in ByteBasedBitPackingEncoder

2017-05-12 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem resolved PARQUET-852.
---
   Resolution: Fixed
Fix Version/s: 1.10.0

Issue resolved by pull request 401
[https://github.com/apache/parquet-mr/pull/401]

> Slowly ramp up sizes of byte[] in ByteBasedBitPackingEncoder
> 
>
> Key: PARQUET-852
> URL: https://issues.apache.org/jira/browse/PARQUET-852
> Project: Parquet
>  Issue Type: Improvement
>Reporter: John Jenkins
>Priority: Minor
> Fix For: 1.10.0
>
>
> The current allocation policy for ByteBasedBitPackingEncoder is to allocate 
> 64KB * #bits up-front. As similarly observed in [PARQUET-580], this can lead 
> to significant memory overheads for high-fanout scenarios (many columns 
> and/or open files, in my case using BooleanPlainValuesWriter).
> As done in [PARQUET-585], I'll follow up with a PR that starts with a smaller 
> buffer and works its way up to a max.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (PARQUET-196) parquet-tools command to get rowcount & size

2017-05-12 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem resolved PARQUET-196.
---
   Resolution: Fixed
Fix Version/s: (was: 1.6.0)
   1.10.0

Issue resolved by pull request 406
[https://github.com/apache/parquet-mr/pull/406]

> parquet-tools command to get rowcount & size
> 
>
> Key: PARQUET-196
> URL: https://issues.apache.org/jira/browse/PARQUET-196
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.6.0
>Reporter: Swapnil
>Priority: Minor
>  Labels: features
> Fix For: 1.10.0
>
>   Original Estimate: 10m
>  Remaining Estimate: 10m
>
> Parquet files contain metadata about rowcount & file size. We should have new 
> commands to get rows count & size.
> These command can be added in parquet-tools:
> 1. rowcount : This should add number of rows in all footers to give total 
> rows in data. 
> 2. size : This should give compresses size in bytes and human readable format 
> too.
> These command helps us to avoid parsing job logs or loading data once again 
> to find number of rows in data. This comes very handy in complex processes, 
> stats generation, QA etc..



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (PARQUET-969) Decimal datatype support for parquet-tools output

2017-05-12 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem resolved PARQUET-969.
---
   Resolution: Fixed
Fix Version/s: 1.10.0

Issue resolved by pull request 412
[https://github.com/apache/parquet-mr/pull/412]

> Decimal datatype support for parquet-tools output
> -
>
> Key: PARQUET-969
> URL: https://issues.apache.org/jira/browse/PARQUET-969
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Dan Fowler
>Priority: Minor
> Fix For: 1.10.0
>
>
> parquet-tools cat outputs decimal datatypes in binary/bytearray format. I 
> would like to have the decimal datatypes converted to their actual number 
> representation so that when parquet data is output from parquet-tools 
> decimals will be numbers.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (PARQUET-980) Cannot read row group larger than 2GB

2017-05-12 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell updated PARQUET-980:
--
Description: 
Parquet MR 1.8.2 does not support reading row groups which are larger than 2 
GB. 
See:https://github.com/apache/parquet-mr/blob/parquet-1.8.x/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L1064

We are seeing this when writing skewed records. This throws off the estimation 
of the memory check interval in the InternalParquetRecordWriter. The following 
spark code illustrates this:
{noformat}
/**
 * Create a data frame that will make parquet write a file with a row group 
larger than 2 GB. Parquet
 * only checks the size of the row group after writing a number of records. 
This number is based on
 * average row size of the already written records. This is problematic in the 
following scenario:
 * - The initial (100) records in the record group are relatively small.
 * - The InternalParquetRecordWriter checks if it needs to write to disk (it 
should not), it assumes
 *   that the remaining records have a similar size, and (greatly) increases 
the check interval (usually
 *   to 1).
 * - The remaining records are much larger then expected, making the row group 
larger than 2 GB (which
 *   makes reading the row group impossible).
 *
 * The data frame below illustrates such a scenario. This creates a row group 
of approximately 4GB.
 */
val badDf = spark.range(0, 2200, 1, 1).mapPartitions { iterator =>
  var i = 0
  val random = new scala.util.Random(42)
  val buffer = new Array[Char](75)
  iterator.map { id =>
// the first 200 records have a length of 1K and the remaining 2000 have a 
length of 750K.
val numChars = if (i < 200) 1000 else 75
i += 1

// create a random array
var j = 0
while (j < numChars) {
  // Generate a char (borrowed from scala.util.Random)
  buffer(j) = (random.nextInt(0xD800 - 1) + 1).toChar
  j += 1
}

// create a string: the string constructor will copy the buffer.
new String(buffer, 0, numChars)
  }
}
badDf.write.parquet("somefile")
val corruptedDf = spark.read.parquet("somefile")
corruptedDf.select(count(lit(1)), max(length($"value"))).show()
{noformat}
The latter fails with the following exception:
{noformat}
java.lang.NegativeArraySizeException
at 
org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1064)
at 
org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:698)
...
{noformat}

-This seems to be fixed by commit 
https://github.com/apache/parquet-mr/commit/6b605a4ea05b66e1a6bf843353abcb4834a4ced8
 in parquet 1.9.x. Is there any chance that we can fix this in 1.8.x?-


  was:
Parquet MR 1.8.2 does not support reading row groups which are larger than 2 
GB. 
See:https://github.com/apache/parquet-mr/blob/parquet-1.8.x/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L1064

We are seeing this when writing skewed records. This throws off the estimation 
of the memory check interval in the InternalParquetRecordWriter. The following 
spark code illustrates this:
{noformat}
/**
 * Create a data frame that will make parquet write a file with a row group 
larger than 2 GB. Parquet
 * only checks the size of the row group after writing a number of records. 
This number is based on
 * average row size of the already written records. This is problematic in the 
following scenario:
 * - The initial (100) records in the record group are relatively small.
 * - The InternalParquetRecordWriter checks if it needs to write to disk (it 
should not), it assumes
 *   that the remaining records have a similar size, and (greatly) increases 
the check interval (usually
 *   to 1).
 * - The remaining records are much larger then expected, making the row group 
larger than 2 GB (which
 *   makes reading the row group impossible).
 *
 * The data frame below illustrates such a scenario. This creates a row group 
of approximately 4GB.
 */
val badDf = spark.range(0, 2200, 1, 1).mapPartitions { iterator =>
  var i = 0
  val random = new scala.util.Random(42)
  val buffer = new Array[Char](75)
  iterator.map { id =>
// the first 200 records have a length of 1K and the remaining 2000 have a 
length of 750K.
val numChars = if (i < 200) 1000 else 75
i += 1

// create a random array
var j = 0
while (j < numChars) {
  // Generate a char (borrowed from scala.util.Random)
  buffer(j) = (random.nextInt(0xD800 - 1) + 1).toChar
  j += 1
}

// create a string: the string constructor will copy the buffer.
new String(buffer, 0, numChars)
  }
}
badDf.write.parquet("somefile")
val corruptedDf = spark.read.parquet("somefile")
corruptedDf.select(count(lit(1)), max(length($"value"))).show()
{noformat}
The latter fails with the following 

[jira] [Resolved] (PARQUET-930) [C++] Account for all Arrow date/time types

2017-05-12 Thread Uwe L. Korn (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn resolved PARQUET-930.
-
Resolution: Fixed

Issue resolved by pull request 321
[https://github.com/apache/parquet-cpp/pull/321]

> [C++] Account for all Arrow date/time types 
> 
>
> Key: PARQUET-930
> URL: https://issues.apache.org/jira/browse/PARQUET-930
> Project: Parquet
>  Issue Type: New Feature
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
> Fix For: cpp-1.1.0
>
>
> Arrow 0.3 has some additional date / time array and metadata that we need to 
> support more completely



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (PARQUET-977) Improve MSVC build

2017-05-12 Thread Uwe L. Korn (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn resolved PARQUET-977.
-
   Resolution: Fixed
Fix Version/s: cpp-1.1.0

Issue resolved by pull request 320
[https://github.com/apache/parquet-cpp/pull/320]

> Improve MSVC build
> --
>
> Key: PARQUET-977
> URL: https://issues.apache.org/jira/browse/PARQUET-977
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
> Environment: windows / msvc
>Reporter: rip.nsk
> Fix For: cpp-1.1.0
>
>
> I'm going to improve and cleanup msvc build of parquet-cpp.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)