[
https://issues.apache.org/jira/browse/PHOENIX-763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071634#comment-14071634
]
Gabriel Reid commented on PHOENIX-763:
--------------------------------------
I'm not super-familiar with Sqoop, but I've taken a look through the code.
One initial thing that I think should be considered is what counts as Sqoop
integration. I think Sqoop is typically intended to get data from a
JDBC-accessible structured data repository into Hadoop (or HBase), as well as
going in the reverse direction. I think the general idea of Sqoop is to allow
getting data from a RDBMS into Hadoop to allow some kind of batch processing,
and then get the batch process results back into the RDBMS .
Phoenix is a bit of a special case because it's a JDBC-accessible structured
data repository, and it's also Hadoop/HBase.
>From what I understand, this patch treats Phoenix as the Hadoop/HBase side of
>the equation, meaning it allows moving data from a JDBC database into Phoenix.
>Another way to go would be to treat Phoenix as the JDBC side of the equation,
>which would mean there would be a way to get data out of Phoenix and then back
>in. I think this second approach is something that should be theoretically
>possible without any code changes in Phoenix (although that is still to be
>seen).
I think both use cases (use Phoenix as a JDBC store or use it as a Hadoop/HBase
store) are interesting, although my initial assumption was that the idea of
this ticket was to use Phoenix as a JDBC store within Sqoop.
Going through the code itself, I had the following notes:
* PhoenixSqoopRecordRecordWriter#write is using the toString of input values in
order to encode them for the upserts. I'm thinking that this will probably have
issues on things like dates, arrays, and binary values.
* PhoenixImportTest should be called PhoenixImportIT and should be in a source
directory called "it", in the same way as other modules have their integration
tests set up. This ensures that slow-running integration tests are only run
when "mvn verify" is run, and more importantly keeps the runtime of "mvn test"
down to a minimum
* Minor nit: I noticed that Throwables.propagate is being used in a few "catch
(SQLException e)" blocks. Throwables.propagate is normally intended for "catch
(Exception e)" blocks, where you're not sure if the underlying exception is a
checked exception or not. It would be equivalent to just do "throw new
RuntimeException(e)"
* I would personally be in favor of using a normal hadoop Configuration object
instead of the sqoop-specific PhoenixConfiguration class as a parameter in the
constructor to PhoenixSqoopRecordWriter. I think a more commonly-used pattern
for something like this is to use the PhoenixConfiguration (or maybe it should
be PhoenixSqoopConfiguration) as a util to set and retrieve values from the
Configuration object, but not replace it entirely
* It looks like the com.cloudera.sqoop classes are used in most spots, instead
of the org.apache.sqoop classes. I think that the general idea is to use the
org.apache.sqoop classes, with the com.cloudera.sqoop classes still being
present only for backwards compatibility
* normalizeColumnNames() is being called in every call to
PhoenixSqoopRecordWriter#write, which seems like it might be a bit inefficient.
If possible, it would probably be better to just do the name conversion once
* it might be safer to re-use SchemaUtil#normalizeIdentifier instead of
PhoenixSqoopRecordWriter#normalizeColumnNames, as I think normalizeIdentifier
handles some special cases (such as quoted column names)
* I was a bit confused about what was going on in PhoenixManagerFactory#accept.
I think it would be a good idea to make it more clear exactly why the
PhoenixManagerFactory class needs to exist and what it's doing
One additional meta note is a bit wider than this once ticket, but it's
something to be considered. There is currently a CSV import tool and Pig
integration thath are both handling bulk writing to Phoenix, and this is adding
in a third implementation of that (basically mapping input records into Phoenix
prepared statements). It would be good to get these three implementations
aligned, although it doesn't necessarily have to happen as part of this ticket.
> Support for Sqoop
> -----------------
>
> Key: PHOENIX-763
> URL: https://issues.apache.org/jira/browse/PHOENIX-763
> Project: Phoenix
> Issue Type: Task
> Affects Versions: 3.0.0
> Reporter: James Taylor
> Assignee: mravi
> Labels: patch
> Attachments: PHOENIX-763.patch
>
>
> Not sure anything required from our end, but you should be able to use Sqoop
> to create and populate Phoenix tables.
--
This message was sent by Atlassian JIRA
(v6.2#6252)