taiyang-li opened a new issue, #8341:
URL: https://github.com/apache/incubator-gluten/issues/8341
### Description
In SplittableBzip2ReadBuffer, we assume that maximum length of single line
would not exceed buffer size, which is currently a fixed value 1MB. But we
found countercase in production, in which the maximum length of single line is
40+MB. The task would encounter exceptions if the assumption is not true.
Query: d_13420.sql
```
2024/12/24 11:20:45.496 ERROR [Executor task launch worker for task 6076.0
in stage 1.0 (TID 6551)] spark.task.TaskResources: Task 6551 failed by error:
org.apache.gluten.exception.GlutenException: Can't find row delimiter in
working buffer with size:1048576 While executing SubstraitFileSource
0. ../contrib/llvm-project/libcxx/include/exception:141:
Poco::Exception::Exception(String const&, int) @ 0x0000000014dda999
1. ./build_new/../src/Common/Exception.cpp:105:
DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @
0x0000000006aeb599
2. ../src/Common/Exception.h:105:
DB::Exception::Exception(PreformattedMessage&&, int) @ 0x000000000699df8c
3. ../src/Common/Exception.h:123: DB::Exception::Exception<unsigned
long&>(int, FormatStringHelperImpl<std::type_identity<unsigned long&>::type>,
unsigned long&) @ 0x0000000006acaf2b
4.
./build_new/../utils/extern-local-engine/IO/SplittableBzip2ReadBuffer.cpp:298:
DB::SplittableBzip2ReadBuffer::nextImpl() @ 0x0000000007183704
5. ../src/IO/ReadBuffer.h:67:
DB::JSONEachRowRowInputFormat::readRow(std::vector<COW<DB::IColumn>::mutable_ptr<DB::IColumn>,
std::allocator<COW<DB::IColumn>::mutable_ptr<DB::IColumn>>>&,
DB::RowReadExtension&) @ 0x0000000011dd10d5
6. ./build_new/../src/Processors/Formats/IRowInputFormat.cpp:143:
DB::IRowInputFormat::read() @ 0x0000000011d8d891
7. ./build_new/../src/Processors/Formats/IInputFormat.cpp:19:
DB::IInputFormat::generate() @ 0x0000000011d891d6
8.
./build_new/../utils/extern-local-engine/Storages/SubstraitSource/SubstraitFileSource.cpp:377:
local_engine::NormalFileReader::pull(DB::Chunk&) @ 0x0000000007227061
9.
./build_new/../utils/extern-local-engine/Storages/SubstraitSource/SubstraitFileSource.cpp:114:
local_engine::SubstraitFileSource::generate() @ 0x0000000007224c0b
10. ./build_new/../src/Processors/ISource.cpp:139:
DB::ISource::tryGenerate() @ 0x0000000011d65a77
11. ./build_new/../src/Processors/ISource.cpp:108: DB::ISource::work() @
0x0000000011d65845
12. ./build_new/../src/Processors/Executors/ExecutionThreadContext.cpp:49:
DB::ExecutionThreadContext::executeTask() @ 0x0000000011d7db22
13. ./build_new/../src/Processors/Executors/PipelineExecutor.cpp:290:
DB::PipelineExecutor::executeStepImpl(unsigned long, std::atomic<bool>*) @
0x0000000011d72b9f
14. ./build_new/../src/Processors/Executors/PipelineExecutor.cpp:256:
DB::PipelineExecutor::executeImpl(unsigned long, bool) @ 0x0000000011d7235d
15. ./build_new/../src/Processors/Executors/PipelineExecutor.cpp:127:
DB::PipelineExecutor::execute(unsigned long, bool) @ 0x0000000011d7214b
16. ./build_new/../utils/extern-local-engine/Parser/LocalExecutor.cpp:130:
local_engine::LocalExecutor::execute() @ 0x0000000006eaf5c1
17. ./build_new/../utils/extern-local-engine/local_engine_jni.cpp:606:
buildAndExecuteShuffle @ 0x0000000006989ec4
18. ./build_new/../utils/extern-local-engine/local_engine_jni.cpp:726:
Java_org_apache_gluten_vectorized_CHShuffleSplitterJniWrapper_nativeMakeForRSS
@ 0x000000000698bb7c
at
org.apache.gluten.vectorized.CHShuffleSplitterJniWrapper.nativeMakeForRSS(Native
Method)
at
org.apache.gluten.vectorized.CHShuffleSplitterJniWrapper.makeForRSS(CHShuffleSplitterJniWrapper.java:73)
at
org.apache.spark.shuffle.CHCelebornColumnarShuffleWriter.internalWrite(CHCelebornColumnarShuffleWriter.scala:87)
at
org.apache.spark.shuffle.CelebornColumnarShuffleWriter.write(CelebornColumnarShuffleWriter.scala:119)
at
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:136)
at
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]