[jira] [Commented] (PARQUET-1632) Negative initial size when writing large values in parquet-mr

2022-02-17 Thread Ranga Reddy (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17493882#comment-17493882
 ] 

Ranga Reddy commented on PARQUET-1632:
--

Before writing the data, if we call *repartition()* it will solve the issue.
{code:java}
df.repartition(200).write.option("compression", 
"uncompressed").mode("overwrite").parquet("/tmp/large.parquet") {code}

> Negative initial size when writing large values in parquet-mr
> -
>
> Key: PARQUET-1632
> URL: https://issues.apache.org/jira/browse/PARQUET-1632
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.10.1
>Reporter: Ivan Sadikov
>Assignee: Junjie Chen
>Priority: Major
>
> I encountered an issue when writing large string values to Parquet.
> Here is the code to reproduce the issue:
> {code:java}
> import org.apache.spark.sql.functions._
> def longString: String = "a" * (64 * 1024 * 1024)
> val long_string = udf(() => longString)
> val df = spark.range(0, 40, 1, 1).withColumn("large_str", long_string())
> spark.conf.set("parquet.enable.dictionary", "false")
> df.write.option("compression", 
> "uncompressed").mode("overwrite").parquet("/tmp/large.parquet"){code}
>  
> This Spark job fails with the exception:
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 10861.0 failed 4 times, most recent failure: Lost task 0.3 in 
> stage 10861.0 (TID 671168, 10.0.180.13, executor 14656): 
> org.apache.spark.SparkException: Task failed while writing rows. at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:257)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at 
> org.apache.spark.scheduler.Task.doRunTask(Task.scala:139) at 
> org.apache.spark.scheduler.Task.run(Task.scala:112) at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:497)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1526) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:503) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748) 
> Caused by: java.lang.IllegalArgumentException: Negative initial size: 
> -1610612543 at 
> java.io.ByteArrayOutputStream.(ByteArrayOutputStream.java:74) at 
> org.apache.parquet.bytes.BytesInput$BAOS.(BytesInput.java:234) at 
> org.apache.parquet.bytes.BytesInput$BAOS.(BytesInput.java:232) at 
> org.apache.parquet.bytes.BytesInput.toByteArray(BytesInput.java:202) at 
> org.apache.parquet.bytes.ConcatenatingByteArrayCollector.collect(ConcatenatingByteArrayCollector.java:33)
>  at 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:126)
>  at 
> org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:147)
>  at 
> org.apache.parquet.column.impl.ColumnWriterV1.flush(ColumnWriterV1.java:235) 
> at 
> org.apache.parquet.column.impl.ColumnWriteStoreV1.flush(ColumnWriteStoreV1.java:122)
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:172)
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:114)
>  at 
> org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:57)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:74)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:247)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242)
>  at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1560)
>  at 
> 

[jira] [Commented] (PARQUET-1632) Negative initial size when writing large values in parquet-mr

2019-08-09 Thread Ivan Sadikov (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16903899#comment-16903899
 ] 

Ivan Sadikov commented on PARQUET-1632:
---

Actually, setting page.size configuration to smaller values results in 
PARQUET-1633, where one can't read the file one has written.

Yes, I see what you mean. arrayOutput overflowed, that resulted in size being 
negative, passed the check on page max size in writePage method and instead 
failed when running buf.collect.

> Negative initial size when writing large values in parquet-mr
> -
>
> Key: PARQUET-1632
> URL: https://issues.apache.org/jira/browse/PARQUET-1632
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.10.1
>Reporter: Ivan Sadikov
>Assignee: Junjie Chen
>Priority: Major
>
> I encountered an issue when writing large string values to Parquet.
> Here is the code to reproduce the issue:
> {code:java}
> import org.apache.spark.sql.functions._
> def longString: String = "a" * (64 * 1024 * 1024)
> val long_string = udf(() => longString)
> val df = spark.range(0, 40, 1, 1).withColumn("large_str", long_string())
> spark.conf.set("parquet.enable.dictionary", "false")
> df.write.option("compression", 
> "uncompressed").mode("overwrite").parquet("/tmp/large.parquet"){code}
>  
> This Spark job fails with the exception:
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 10861.0 failed 4 times, most recent failure: Lost task 0.3 in 
> stage 10861.0 (TID 671168, 10.0.180.13, executor 14656): 
> org.apache.spark.SparkException: Task failed while writing rows. at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:257)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at 
> org.apache.spark.scheduler.Task.doRunTask(Task.scala:139) at 
> org.apache.spark.scheduler.Task.run(Task.scala:112) at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:497)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1526) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:503) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748) 
> Caused by: java.lang.IllegalArgumentException: Negative initial size: 
> -1610612543 at 
> java.io.ByteArrayOutputStream.(ByteArrayOutputStream.java:74) at 
> org.apache.parquet.bytes.BytesInput$BAOS.(BytesInput.java:234) at 
> org.apache.parquet.bytes.BytesInput$BAOS.(BytesInput.java:232) at 
> org.apache.parquet.bytes.BytesInput.toByteArray(BytesInput.java:202) at 
> org.apache.parquet.bytes.ConcatenatingByteArrayCollector.collect(ConcatenatingByteArrayCollector.java:33)
>  at 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:126)
>  at 
> org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:147)
>  at 
> org.apache.parquet.column.impl.ColumnWriterV1.flush(ColumnWriterV1.java:235) 
> at 
> org.apache.parquet.column.impl.ColumnWriteStoreV1.flush(ColumnWriteStoreV1.java:122)
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:172)
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:114)
>  at 
> org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:57)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:74)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:247)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242)
>  at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1560)
>  at 
> 

[jira] [Commented] (PARQUET-1632) Negative initial size when writing large values in parquet-mr

2019-08-09 Thread Junjie Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16903870#comment-16903870
 ] 

Junjie Chen commented on PARQUET-1632:
--

Reopen this first.

 

I think the ByteInput get from PlainValueWriter.getBytes() already contains 
wrong information. The arrayOutput of the PlainValuesWriter is already 
overflowed. 

> Negative initial size when writing large values in parquet-mr
> -
>
> Key: PARQUET-1632
> URL: https://issues.apache.org/jira/browse/PARQUET-1632
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.10.1
>Reporter: Ivan Sadikov
>Assignee: Junjie Chen
>Priority: Major
>
> I encountered an issue when writing large string values to Parquet.
> Here is the code to reproduce the issue:
> {code:java}
> import org.apache.spark.sql.functions._
> def longString: String = "a" * (64 * 1024 * 1024)
> val long_string = udf(() => longString)
> val df = spark.range(0, 40, 1, 1).withColumn("large_str", long_string())
> spark.conf.set("parquet.enable.dictionary", "false")
> df.write.option("compression", 
> "uncompressed").mode("overwrite").parquet("/tmp/large.parquet"){code}
>  
> This Spark job fails with the exception:
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 10861.0 failed 4 times, most recent failure: Lost task 0.3 in 
> stage 10861.0 (TID 671168, 10.0.180.13, executor 14656): 
> org.apache.spark.SparkException: Task failed while writing rows. at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:257)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at 
> org.apache.spark.scheduler.Task.doRunTask(Task.scala:139) at 
> org.apache.spark.scheduler.Task.run(Task.scala:112) at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:497)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1526) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:503) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748) 
> Caused by: java.lang.IllegalArgumentException: Negative initial size: 
> -1610612543 at 
> java.io.ByteArrayOutputStream.(ByteArrayOutputStream.java:74) at 
> org.apache.parquet.bytes.BytesInput$BAOS.(BytesInput.java:234) at 
> org.apache.parquet.bytes.BytesInput$BAOS.(BytesInput.java:232) at 
> org.apache.parquet.bytes.BytesInput.toByteArray(BytesInput.java:202) at 
> org.apache.parquet.bytes.ConcatenatingByteArrayCollector.collect(ConcatenatingByteArrayCollector.java:33)
>  at 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:126)
>  at 
> org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:147)
>  at 
> org.apache.parquet.column.impl.ColumnWriterV1.flush(ColumnWriterV1.java:235) 
> at 
> org.apache.parquet.column.impl.ColumnWriteStoreV1.flush(ColumnWriteStoreV1.java:122)
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:172)
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:114)
>  at 
> org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:57)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:74)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:247)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242)
>  at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1560)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248)
>  ... 11 

[jira] [Commented] (PARQUET-1632) Negative initial size when writing large values in parquet-mr

2019-08-08 Thread Ivan Sadikov (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16903082#comment-16903082
 ] 

Ivan Sadikov commented on PARQUET-1632:
---

I don't think your assessment is correct. Yes, it overflows when we cast size() 
to integer, even though the size can be long. 

It looks like the problem is ConcatenatingByteArrayCollector and toByteArray 
method. ConcatenatingByteArrayCollector class should handle arbitrary 
BytesInput by checking its size first and splitting the data across slabs (in 
this particular case even when the total size is larger than i32 max value) - 
I'd say that is the whole point of concatenating byte array in the first place.

Note that BytesInput instance itself did not fail with overflow. It only failed 
when turning it to a byte array. If what you mentioned in the comment was true, 
I would expect failure when creating an instance of BytesInput.

Collector should be filling up the last slab and allocating those as it 
requests more and more data from BytesInput, similar to output stream.

 

Regarding your comment around page size check - it is merely a mitigation 
strategy that makes it pass the error. It is not a fix. I already included it 
in PARQUET-1633.

My understanding is parquet.page.size.row.check.min affects the page size 
itself. If page size was larger than 2GB (which, I would expect the code to 
fail here:

[https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageWriteStore.java#L116]

Or here:

[https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageWriteStore.java#L123]

Those conditions explicitly check for the page size. But the code did not fail 
there.

> Negative initial size when writing large values in parquet-mr
> -
>
> Key: PARQUET-1632
> URL: https://issues.apache.org/jira/browse/PARQUET-1632
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.10.1
>Reporter: Ivan Sadikov
>Assignee: Junjie Chen
>Priority: Major
>
> I encountered an issue when writing large string values to Parquet.
> Here is the code to reproduce the issue:
> {code:java}
> import org.apache.spark.sql.functions._
> def longString: String = "a" * (64 * 1024 * 1024)
> val long_string = udf(() => longString)
> val df = spark.range(0, 40, 1, 1).withColumn("large_str", long_string())
> spark.conf.set("parquet.enable.dictionary", "false")
> df.write.option("compression", 
> "uncompressed").mode("overwrite").parquet("/tmp/large.parquet"){code}
>  
> This Spark job fails with the exception:
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 10861.0 failed 4 times, most recent failure: Lost task 0.3 in 
> stage 10861.0 (TID 671168, 10.0.180.13, executor 14656): 
> org.apache.spark.SparkException: Task failed while writing rows. at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:257)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at 
> org.apache.spark.scheduler.Task.doRunTask(Task.scala:139) at 
> org.apache.spark.scheduler.Task.run(Task.scala:112) at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:497)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1526) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:503) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748) 
> Caused by: java.lang.IllegalArgumentException: Negative initial size: 
> -1610612543 at 
> java.io.ByteArrayOutputStream.(ByteArrayOutputStream.java:74) at 
> org.apache.parquet.bytes.BytesInput$BAOS.(BytesInput.java:234) at 
> org.apache.parquet.bytes.BytesInput$BAOS.(BytesInput.java:232) at 
> org.apache.parquet.bytes.BytesInput.toByteArray(BytesInput.java:202) at 
> org.apache.parquet.bytes.ConcatenatingByteArrayCollector.collect(ConcatenatingByteArrayCollector.java:33)
>  at 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:126)
>  at 
> org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:147)
>  at 
> 

[jira] [Commented] (PARQUET-1632) Negative initial size when writing large values in parquet-mr

2019-08-08 Thread Junjie Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16902820#comment-16902820
 ] 

Junjie Chen commented on PARQUET-1632:
--

The CapacityByteArrayOutputStream is overflowed since it plans to handle 40 
values of 64M bytes while the counters of CapacityByteArrayOutputStream are 
signed integers which can only allocate maximum 2G (2 << 31) bytes.  This is 
because parquet.page.size.row.check.min is default 100 and it checks buffer 
when row count reaches 100.

You can set parquet.page.size.row.check.min to a lower value such as 20 to 
ensure executing memory check before the counters overflow so that the 
CapacityByteArrayOutputStream will reset the buffer.

 

 

 

 

> Negative initial size when writing large values in parquet-mr
> -
>
> Key: PARQUET-1632
> URL: https://issues.apache.org/jira/browse/PARQUET-1632
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.10.1
>Reporter: Ivan Sadikov
>Assignee: Junjie Chen
>Priority: Major
>
> I encountered an issue when writing large string values to Parquet.
> Here is the code to reproduce the issue:
> {code:java}
> import org.apache.spark.sql.functions._
> def longString: String = "a" * (64 * 1024 * 1024)
> val long_string = udf(() => longString)
> val df = spark.range(0, 40, 1, 1).withColumn("large_str", long_string())
> spark.conf.set("parquet.enable.dictionary", "false")
> df.write.option("compression", 
> "uncompressed").mode("overwrite").parquet("/tmp/large.parquet"){code}
>  
> This Spark job fails with the exception:
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 10861.0 failed 4 times, most recent failure: Lost task 0.3 in 
> stage 10861.0 (TID 671168, 10.0.180.13, executor 14656): 
> org.apache.spark.SparkException: Task failed while writing rows. at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:257)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at 
> org.apache.spark.scheduler.Task.doRunTask(Task.scala:139) at 
> org.apache.spark.scheduler.Task.run(Task.scala:112) at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:497)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1526) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:503) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748) Caused by: 
> java.lang.IllegalArgumentException: Negative initial size: -1610612543 at 
> java.io.ByteArrayOutputStream.(ByteArrayOutputStream.java:74) at 
> org.apache.parquet.bytes.BytesInput$BAOS.(BytesInput.java:234) at 
> org.apache.parquet.bytes.BytesInput$BAOS.(BytesInput.java:232) at 
> org.apache.parquet.bytes.BytesInput.toByteArray(BytesInput.java:202) at 
> org.apache.parquet.bytes.ConcatenatingByteArrayCollector.collect(ConcatenatingByteArrayCollector.java:33)
>  at 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:126)
>  at 
> org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:147)
>  at 
> org.apache.parquet.column.impl.ColumnWriterV1.flush(ColumnWriterV1.java:235) 
> at 
> org.apache.parquet.column.impl.ColumnWriteStoreV1.flush(ColumnWriteStoreV1.java:122)
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:172)
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:114)
>  at 
> org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:57)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:74)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:247)
>  at 
> 

[jira] [Commented] (PARQUET-1632) Negative initial size when writing large values in parquet-mr

2019-08-07 Thread Ivan Sadikov (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16902098#comment-16902098
 ] 

Ivan Sadikov commented on PARQUET-1632:
---

Thank you.

> Negative initial size when writing large values in parquet-mr
> -
>
> Key: PARQUET-1632
> URL: https://issues.apache.org/jira/browse/PARQUET-1632
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.10.1
>Reporter: Ivan Sadikov
>Assignee: Junjie Chen
>Priority: Major
>
> I encountered an issue when writing large string values to Parquet.
> Here is the code to reproduce the issue:
> {code:java}
> import org.apache.spark.sql.functions._
> def longString: String = "a" * (64 * 1024 * 1024)
> val long_string = udf(() => longString)
> val df = spark.range(0, 40, 1, 1).withColumn("large_str", long_string())
> spark.conf.set("parquet.enable.dictionary", "false")
> df.write.option("compression", 
> "uncompressed").mode("overwrite").parquet("/tmp/large.parquet"){code}
>  
> This Spark job fails with the exception:
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 10861.0 failed 4 times, most recent failure: Lost task 0.3 in 
> stage 10861.0 (TID 671168, 10.0.180.13, executor 14656): 
> org.apache.spark.SparkException: Task failed while writing rows. at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:257)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at 
> org.apache.spark.scheduler.Task.doRunTask(Task.scala:139) at 
> org.apache.spark.scheduler.Task.run(Task.scala:112) at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:497)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1526) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:503) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748) Caused by: 
> java.lang.IllegalArgumentException: Negative initial size: -1610612543 at 
> java.io.ByteArrayOutputStream.(ByteArrayOutputStream.java:74) at 
> org.apache.parquet.bytes.BytesInput$BAOS.(BytesInput.java:234) at 
> org.apache.parquet.bytes.BytesInput$BAOS.(BytesInput.java:232) at 
> org.apache.parquet.bytes.BytesInput.toByteArray(BytesInput.java:202) at 
> org.apache.parquet.bytes.ConcatenatingByteArrayCollector.collect(ConcatenatingByteArrayCollector.java:33)
>  at 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:126)
>  at 
> org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:147)
>  at 
> org.apache.parquet.column.impl.ColumnWriterV1.flush(ColumnWriterV1.java:235) 
> at 
> org.apache.parquet.column.impl.ColumnWriteStoreV1.flush(ColumnWriteStoreV1.java:122)
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:172)
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:114)
>  at 
> org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:57)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:74)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:247)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242)
>  at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1560)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248)
>  ... 11 more{code}
>  
> Would appreciate if you could help addressing the problem. Thanks!



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (PARQUET-1632) Negative initial size when writing large values in parquet-mr

2019-08-07 Thread Junjie Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901892#comment-16901892
 ] 

Junjie Chen commented on PARQUET-1632:
--

I will take a look into this.

> Negative initial size when writing large values in parquet-mr
> -
>
> Key: PARQUET-1632
> URL: https://issues.apache.org/jira/browse/PARQUET-1632
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.10.1
>Reporter: Ivan Sadikov
>Priority: Major
>
> I encountered an issue when writing large string values to Parquet.
> Here is the code to reproduce the issue:
> {code:java}
> import org.apache.spark.sql.functions._
> def longString: String = "a" * (64 * 1024 * 1024)
> val long_string = udf(() => longString)
> val df = spark.range(0, 40, 1, 1).withColumn("large_str", long_string())
> spark.conf.set("parquet.enable.dictionary", "false")
> df.write.option("compression", 
> "uncompressed").mode("overwrite").parquet("/tmp/large.parquet"){code}
>  
> This Spark job fails with the exception:
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 10861.0 failed 4 times, most recent failure: Lost task 0.3 in 
> stage 10861.0 (TID 671168, 10.0.180.13, executor 14656): 
> org.apache.spark.SparkException: Task failed while writing rows. at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:257)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at 
> org.apache.spark.scheduler.Task.doRunTask(Task.scala:139) at 
> org.apache.spark.scheduler.Task.run(Task.scala:112) at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:497)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1526) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:503) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748) Caused by: 
> java.lang.IllegalArgumentException: Negative initial size: -1610612543 at 
> java.io.ByteArrayOutputStream.(ByteArrayOutputStream.java:74) at 
> org.apache.parquet.bytes.BytesInput$BAOS.(BytesInput.java:234) at 
> org.apache.parquet.bytes.BytesInput$BAOS.(BytesInput.java:232) at 
> org.apache.parquet.bytes.BytesInput.toByteArray(BytesInput.java:202) at 
> org.apache.parquet.bytes.ConcatenatingByteArrayCollector.collect(ConcatenatingByteArrayCollector.java:33)
>  at 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:126)
>  at 
> org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:147)
>  at 
> org.apache.parquet.column.impl.ColumnWriterV1.flush(ColumnWriterV1.java:235) 
> at 
> org.apache.parquet.column.impl.ColumnWriteStoreV1.flush(ColumnWriteStoreV1.java:122)
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:172)
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:114)
>  at 
> org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:57)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:74)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:247)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242)
>  at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1560)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248)
>  ... 11 more{code}
>  
> Would appreciate if you could help addressing the problem. Thanks!



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)