[I] [CH] Respect max_read_buffer_size for bzip2 decompression [incubator-gluten]

via GitHub Tue, 24 Dec 2024 22:46:40 -0800


taiyang-li opened a new issue, #8341:
URL: https://github.com/apache/incubator-gluten/issues/8341


   ### Description
   
   In SplittableBzip2ReadBuffer, we assume that maximum length of single line 
would not exceed buffer size, which is currently a fixed value 1MB. But we 
found countercase in production, in which the maximum length of single line is 
40+MB. The task would encounter exceptions if the assumption is not true. 
   
   Query: d_13420.sql 
   ```
   2024/12/24 11:20:45.496 ERROR [Executor task launch worker for task 6076.0 
in stage 1.0 (TID 6551)] spark.task.TaskResources: Task 6551 failed by error: 
   org.apache.gluten.exception.GlutenException: Can't find row delimiter in 
working buffer with size:1048576 While executing SubstraitFileSource
   0. ../contrib/llvm-project/libcxx/include/exception:141: 
Poco::Exception::Exception(String const&, int) @ 0x0000000014dda999
   1. ./build_new/../src/Common/Exception.cpp:105: 
DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 
0x0000000006aeb599
   2. ../src/Common/Exception.h:105: 
DB::Exception::Exception(PreformattedMessage&&, int) @ 0x000000000699df8c
   3. ../src/Common/Exception.h:123: DB::Exception::Exception<unsigned 
long&>(int, FormatStringHelperImpl<std::type_identity<unsigned long&>::type>, 
unsigned long&) @ 0x0000000006acaf2b
   4. 
./build_new/../utils/extern-local-engine/IO/SplittableBzip2ReadBuffer.cpp:298: 
DB::SplittableBzip2ReadBuffer::nextImpl() @ 0x0000000007183704
   5. ../src/IO/ReadBuffer.h:67: 
DB::JSONEachRowRowInputFormat::readRow(std::vector<COW<DB::IColumn>::mutable_ptr<DB::IColumn>,
 std::allocator<COW<DB::IColumn>::mutable_ptr<DB::IColumn>>>&, 
DB::RowReadExtension&) @ 0x0000000011dd10d5
   6. ./build_new/../src/Processors/Formats/IRowInputFormat.cpp:143: 
DB::IRowInputFormat::read() @ 0x0000000011d8d891
   7. ./build_new/../src/Processors/Formats/IInputFormat.cpp:19: 
DB::IInputFormat::generate() @ 0x0000000011d891d6
   8. 
./build_new/../utils/extern-local-engine/Storages/SubstraitSource/SubstraitFileSource.cpp:377:
 local_engine::NormalFileReader::pull(DB::Chunk&) @ 0x0000000007227061
   9. 
./build_new/../utils/extern-local-engine/Storages/SubstraitSource/SubstraitFileSource.cpp:114:
 local_engine::SubstraitFileSource::generate() @ 0x0000000007224c0b
   10. ./build_new/../src/Processors/ISource.cpp:139: 
DB::ISource::tryGenerate() @ 0x0000000011d65a77
   11. ./build_new/../src/Processors/ISource.cpp:108: DB::ISource::work() @ 
0x0000000011d65845
   12. ./build_new/../src/Processors/Executors/ExecutionThreadContext.cpp:49: 
DB::ExecutionThreadContext::executeTask() @ 0x0000000011d7db22
   13. ./build_new/../src/Processors/Executors/PipelineExecutor.cpp:290: 
DB::PipelineExecutor::executeStepImpl(unsigned long, std::atomic<bool>*) @ 
0x0000000011d72b9f
   14. ./build_new/../src/Processors/Executors/PipelineExecutor.cpp:256: 
DB::PipelineExecutor::executeImpl(unsigned long, bool) @ 0x0000000011d7235d
   15. ./build_new/../src/Processors/Executors/PipelineExecutor.cpp:127: 
DB::PipelineExecutor::execute(unsigned long, bool) @ 0x0000000011d7214b
   16. ./build_new/../utils/extern-local-engine/Parser/LocalExecutor.cpp:130: 
local_engine::LocalExecutor::execute() @ 0x0000000006eaf5c1
   17. ./build_new/../utils/extern-local-engine/local_engine_jni.cpp:606: 
buildAndExecuteShuffle @ 0x0000000006989ec4
   18. ./build_new/../utils/extern-local-engine/local_engine_jni.cpp:726: 
Java_org_apache_gluten_vectorized_CHShuffleSplitterJniWrapper_nativeMakeForRSS 
@ 0x000000000698bb7c
   
        at 
org.apache.gluten.vectorized.CHShuffleSplitterJniWrapper.nativeMakeForRSS(Native
 Method)
        at 
org.apache.gluten.vectorized.CHShuffleSplitterJniWrapper.makeForRSS(CHShuffleSplitterJniWrapper.java:73)
        at 
org.apache.spark.shuffle.CHCelebornColumnarShuffleWriter.internalWrite(CHCelebornColumnarShuffleWriter.scala:87)
        at 
org.apache.spark.shuffle.CelebornColumnarShuffleWriter.write(CelebornColumnarShuffleWriter.scala:119)
        at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
        at org.apache.spark.scheduler.Task.run(Task.scala:136)
        at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] [CH] Respect max_read_buffer_size for bzip2 decompression [incubator-gluten]

Reply via email to