Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10165 )

Change subject: IMPALA-6941: load more text scanner compression plugins
......................................................................


Patch Set 6:

(12 comments)

http://gerrit.cloudera.org:8080/#/c/10165/6/be/src/exec/hdfs-plugin-text-scanner.h
File be/src/exec/hdfs-plugin-text-scanner.h:

http://gerrit.cloudera.org:8080/#/c/10165/6/be/src/exec/hdfs-plugin-text-scanner.h@63
PS6, Line 63: HdfsScanner* (*create_scanner_fn)
            :         (HdfsScanNodeBase* scan_node, RuntimeState* state)
> Maybe you could create a typedef for the function pointers, e.g.:
Done


http://gerrit.cloudera.org:8080/#/c/10165/6/be/src/exec/hdfs-plugin-text-scanner.h@67
PS6, Line 67: Status (*issue_initial_ranges_fn)(
            :         HdfsScanNodeBase* scan_node, const 
std::vector<HdfsFileDesc*>& files)
> same as above
Done


http://gerrit.cloudera.org:8080/#/c/10165/6/be/src/exec/hdfs-plugin-text-scanner.h@73
PS6, Line 73: SpinLock
> Wouldn't be better to use some read/write lock like boost::shared_mutex?
Yeah I think that makes sense. I was thinking that the critical section is 
probably short enough that it doesn't matter in practice but we don't have to 
reason about that at all with the shared_mutex.


http://gerrit.cloudera.org:8080/#/c/10165/6/be/src/exec/hdfs-plugin-text-scanner.cc
File be/src/exec/hdfs-plugin-text-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/10165/6/be/src/exec/hdfs-plugin-text-scanner.cc@42
PS6, Line 42: "(Advanced) whitelist of HDFS "
            :     "text scanner plugins that Impala will try to dynamically 
load."
> I think you should mention that it is comma-separated, or give an example.
I briefly summarized what the plugin does (versus the builtin text parsing). I 
think that addresses the question but lmk if it doesn't.


http://gerrit.cloudera.org:8080/#/c/10165/6/be/src/exec/hdfs-plugin-text-scanner.cc@56
PS6, Line 56: HdfsScanner* (*create_scanner_fn) (HdfsScanNodeBase* scan_node, 
RuntimeState* state)
> You could use the typedef here.
Done


http://gerrit.cloudera.org:8080/#/c/10165/6/be/src/exec/hdfs-plugin-text-scanner.cc@71
PS6, Line 71: Status (*issue_initial_ranges_fn)(
            :       HdfsScanNodeBase* scan_node, const 
std::vector<HdfsFileDesc*>& files)
> and here
Done


http://gerrit.cloudera.org:8080/#/c/10165/6/be/src/exec/hdfs-plugin-text-scanner.cc@123
PS6, Line 123:       "GetImpalaBuildVersion", 
reinterpret_cast<void**>(&get_plugin_impala_build_version)));
> nit: too long line
Done


http://gerrit.cloudera.org:8080/#/c/10165/6/be/src/exec/hdfs-plugin-text-scanner.cc@126
PS6, Line 126:     stringstream ss;
             :     ss << "Scanner plugin " << plugin_name << " was built 
against Impala version "
             :        << get_plugin_impala_build_version() << ", but the 
running Impala version is "
             :        << GetDaemonBuildVersion();
> nit: I think it's a little inconsistent that we use stringstream here and S
Done. Yeah I got lazy about converting this :)


http://gerrit.cloudera.org:8080/#/c/10165/6/be/src/exec/hdfs-scan-node-base.cc
File be/src/exec/hdfs-scan-node-base.cc:

http://gerrit.cloudera.org:8080/#/c/10165/6/be/src/exec/hdfs-scan-node-base.cc@643
PS6, Line 643: =_
> nit: missing space
Done


http://gerrit.cloudera.org:8080/#/c/10165/6/be/src/exec/hdfs-text-scanner.cc
File be/src/exec/hdfs-text-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/10165/6/be/src/exec/hdfs-text-scanner.cc@129
PS6, Line 129: =_
> nit: missing space
Done


http://gerrit.cloudera.org:8080/#/c/10165/6/fe/src/main/java/org/apache/impala/catalog/HdfsCompression.java
File fe/src/main/java/org/apache/impala/catalog/HdfsCompression.java:

http://gerrit.cloudera.org:8080/#/c/10165/6/fe/src/main/java/org/apache/impala/catalog/HdfsCompression.java@41
PS6, Line 41:  LZO,
            :   LZO_INDEX, //Lzo index file.
            :   LZ4,
            :   ZSTD;
> should we make these (and e.g. file extension) be runtime parameters as wel
This and related enums (THdfsCompression, the flatbuffer version) are used in a 
lot of places, so it would probably require a lot of changes. I don't think 
it's inherently hard but it would take a while to verify that the fix in each 
place is correct.

At least now we have all the built-in Hadoop decompression codecs: 
https://github.com/apache/hadoop/tree/a0a276162147e843a5a4e028abdca5b66f5118da/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/compress
 so we could just extend this list when a codec is added to Hadoop (which seems 
to happen at most once a year).


http://gerrit.cloudera.org:8080/#/c/10165/6/tests/metadata/test_partition_metadata.py
File tests/metadata/test_partition_metadata.py:

http://gerrit.cloudera.org:8080/#/c/10165/6/tests/metadata/test_partition_metadata.py@168
PS6, Line 168: #unique_database = 'test_db'
> nit: remove this line
Done



--
To view, visit http://gerrit.cloudera.org:8080/10165
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If2a9c4a4a11bed81df706e9e834400bfedfe48e6
Gerrit-Change-Number: 10165
Gerrit-PatchSet: 6
Gerrit-Owner: Tim Armstrong <[email protected]>
Gerrit-Reviewer: Dan Hecht <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Wed, 02 May 2018 22:44:26 +0000
Gerrit-HasComments: Yes

Reply via email to