Steve Loughran created HADOOP-18830:
---------------------------------------
Summary: S3 Select: deprecate vs cut
Key: HADOOP-18830
URL: https://issues.apache.org/jira/browse/HADOOP-18830
Project: Hadoop Common
Issue Type: Sub-task
Components: fs/s3
Affects Versions: 3.4.0
Reporter: Steve Loughran
getting s3 select to work with the v2 sdk is tricky, we need to add extra
libraries to the classpath beyond just bundle.jar. we can do this but
* AFAIK nobody has ever done CSV predicate pushdown, as it breaks split logic
completely
* CSV is a bad format
* one-line JSON more structured but also way less efficient
ORC/Parquet benefit from vectored IO and work spanning the cluster.
accordingly, I'm wondering what to do about s3 select
# cut?
# downgrade to optional and document the extra classes on the classpath
Option #2 is straightforward and effectively the default. we can also declare
the feature deprecated.
{code}
[ERROR]
testReadLandsatRecordsNoMatch(org.apache.hadoop.fs.s3a.select.ITestS3SelectLandsat)
Time elapsed: 147.958 s <<< ERROR!
java.io.IOException: java.lang.NoClassDefFoundError:
software/amazon/eventstream/MessageDecoder
at
org.apache.hadoop.fs.s3a.select.SelectObjectContentHelper.select(SelectObjectContentHelper.java:75)
at
org.apache.hadoop.fs.s3a.WriteOperationHelper.lambda$select$10(WriteOperationHelper.java:660)
at
org.apache.hadoop.fs.store.audit.AuditingFunctions.lambda$withinAuditSpan$0(AuditingFunctions.java:62)
at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:122)
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]