[jira] [Commented] (IMPALA-10005) Impala can't read Snappy compressed text files on S3 or ABFS

ASF subversion and git services (Jira) Thu, 06 Aug 2020 13:13:13 -0700


    [ 
https://issues.apache.org/jira/browse/IMPALA-10005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17172649#comment-17172649
 ]


ASF subversion and git services commented on IMPALA-10005:
----------------------------------------------------------

Commit dbbd40308a6d1cef77bfe45e016e775c918e0539 in impala's branch 
refs/heads/master from Joe McDonnell
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=dbbd403 ]

IMPALA-10005: Fix Snappy decompression for non-block filesystems

Snappy-compressed text always uses THdfsCompression::SNAPPY_BLOCKED
type compression in the backend. However, for non-block filesystems,
the frontend is incorrectly passing THdfsCompression::SNAPPY instead.
On debug builds, this leads to a DCHECK when trying to read
Snappy-compressed text. On release builds, it fails to decompress
the data.

This fixes the frontend to always pass THdfsCompression::SNAPPY_BLOCKED
for Snappy-compressed text.

This reworks query_test/test_compressed_formats.py to provide better
coverage:
 - Changed the RC and Seq test cases to verify that the file extension
   doesn't matter. Added Avro to this case as well.
 - Fixed the text case to use appropriate extensions (fixing IMPALA-9004)
 - Changed the utility function so it doesn't use Hive. This allows it
   to be enabled on non-HDFS filesystems like S3.
 - Changed the test to use unique_database and allow parallel execution.
 - Changed the test to run in the core job, so it now has coverage on
   the usual S3 test configuration. It is reasonably quick (1-2 minutes)
   and runs in parallel.

Testing:
 - Exhaustive job
 - Core s3 job
 - Changed the frontend to force it to use the code for non-block
   filesystems (i.e. the TFileSplitGeneratorSpec code) and
   verified that it is now able to read Snappy-compressed text.

Change-Id: I0879f2fc0bf75bb5c15cecb845ece46a901601ac
Reviewed-on: http://gerrit.cloudera.org:8080/16278
Tested-by: Impala Public Jenkins <[email protected]>
Reviewed-by: Sahil Takiar <[email protected]>


> Impala can't read Snappy compressed text files on S3 or ABFS
> ------------------------------------------------------------
>
>                 Key: IMPALA-10005
>                 URL: https://issues.apache.org/jira/browse/IMPALA-10005
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>    Affects Versions: Impala 4.0
>            Reporter: Joe McDonnell
>            Assignee: Joe McDonnell
>            Priority: Blocker
>
> When reading snappy compressed text from S3 or ABFS on a release build, it 
> fails to decompress:
>  
> {noformat}
> I0723 21:19:43.712909 229706 status.cc:128] Snappy: RawUncompress failed
>     @           0xae26c9  impala::Status::Status()
>     @          0x107635b  impala::SnappyDecompressor::ProcessBlock()
>     @          0x11b1f2d  
> impala::HdfsTextScanner::FillByteBufferCompressedFile()
>     @          0x11b23ef  impala::HdfsTextScanner::FillByteBuffer()
>     @          0x11af96f  impala::HdfsTextScanner::FillByteBufferWrapper()
>     @          0x11b096b  impala::HdfsTextScanner::ProcessRange()
>     @          0x11b2b31  impala::HdfsTextScanner::GetNextInternal()
>     @          0x118644b  impala::HdfsScanner::ProcessSplit()
>     @          0x11774c2  impala::HdfsScanNode::ProcessSplit()
>     @          0x1178805  impala::HdfsScanNode::ScannerThread()
>     @          0x1100f31  impala::Thread::SuperviseThread()
>     @          0x1101a79  boost::detail::thread_data<>::run()
>     @          0x16a3449  thread_proxy
>     @     0x7fc522befe24  start_thread
>     @     0x7fc522919bac  __clone{noformat}
> When using a debug build, Impala hits the following DCHECK:
>  
>  
> {noformat}
> F0723 23:45:12.849973 249653 hdfs-text-scanner.cc:197] Check failed: 
> stream_>file_desc()>file_compression != THdfsCompression::SNAPPY FE should 
> have generated SNAPPY_BLOCKED instead.{noformat}
> That DCHECK explains why it would fail to decompress. It is using the wrong 
> THdfsCompression.
> I reproduced this on master in my dev env by changing 
> FileSystemUtil::supportsStorageIds() to always return true. This emulates the 
> behavior on object stores like S3 and ABFS.
>  
> {noformat}
>   /**
>    * Returns true if the filesystem supports storage UUIDs in BlockLocation 
> calls.
>    */
>   public static boolean supportsStorageIds(FileSystem fs) {
>     return false;
>   }{noformat}
> This is specific to Snappy and does not appear to apply to other compression 
> codecs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (IMPALA-10005) Impala can't read Snappy compressed text files on S3 or ABFS

Reply via email to