[jira] [Commented] (PHOENIX-763) Support for Sqoop

Gabriel Reid (JIRA) Wed, 23 Jul 2014 04:48:23 -0700

    [ 
https://issues.apache.org/jira/browse/PHOENIX-763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071634#comment-14071634
 ]


Gabriel Reid commented on PHOENIX-763:
--------------------------------------

I'm not super-familiar with Sqoop, but I've taken a look through the code.

One initial thing that I think should be considered is what counts as Sqoop 
integration. I think Sqoop is typically intended to get data from a 
JDBC-accessible structured data repository into Hadoop (or HBase), as well as 
going in the reverse direction. I think the general idea of Sqoop is to allow 
getting data from a RDBMS into Hadoop to allow some kind of batch processing, 
and then get the batch process results back into the RDBMS .

Phoenix is a bit of a special case because it's a JDBC-accessible structured 
data repository, and it's also Hadoop/HBase.

>From what I understand, this patch treats Phoenix as the Hadoop/HBase side of 
>the equation, meaning it allows moving data from a JDBC database into Phoenix. 
>Another way to go would be to treat Phoenix as the JDBC side of the equation, 
>which would mean there would be a way to get data out of Phoenix and then back 
>in. I think this second approach is something that should be theoretically 
>possible without any code changes in Phoenix (although that is still to be 
>seen).

I think both use cases (use Phoenix as a JDBC store or use it as a Hadoop/HBase 
store) are interesting, although my initial assumption was that the idea of 
this ticket was to use Phoenix as a JDBC store within Sqoop.

Going through the code itself, I had the following notes:
* PhoenixSqoopRecordRecordWriter#write is using the toString of input values in 
order to encode them for the upserts. I'm thinking that this will probably have 
issues on things like dates, arrays, and binary values.
* PhoenixImportTest should be called PhoenixImportIT and should be in a source 
directory called "it", in the same way as other modules have their integration 
tests set up. This ensures that slow-running integration tests are only run 
when "mvn verify" is run, and more importantly keeps the runtime of "mvn test" 
down to a minimum
* Minor nit: I noticed that Throwables.propagate is being used in a few "catch 
(SQLException e)" blocks. Throwables.propagate is normally intended for "catch 
(Exception e)" blocks, where you're not sure if the underlying exception is a 
checked exception or not. It would be equivalent to just do "throw new 
RuntimeException(e)"
* I would personally be in favor of using a normal hadoop Configuration object 
instead of the sqoop-specific PhoenixConfiguration class as a parameter in the 
constructor to PhoenixSqoopRecordWriter. I think a more commonly-used pattern 
for something like this is to use the PhoenixConfiguration (or maybe it should 
be PhoenixSqoopConfiguration) as a util to set and retrieve values from the 
Configuration object, but not replace it entirely
* It looks like the com.cloudera.sqoop classes are used in most spots, instead 
of the org.apache.sqoop classes. I think that the general idea is to use the 
org.apache.sqoop classes, with the com.cloudera.sqoop classes still being 
present only for backwards compatibility
* normalizeColumnNames() is being called in every call to 
PhoenixSqoopRecordWriter#write, which seems like it might be a bit inefficient. 
If possible, it would probably be better to just do the name conversion once
* it might be safer to re-use SchemaUtil#normalizeIdentifier instead of 
PhoenixSqoopRecordWriter#normalizeColumnNames, as I think normalizeIdentifier 
handles some special cases (such as quoted column names)
* I was a bit confused about what was going on in PhoenixManagerFactory#accept. 
I think it would be a good idea to make it more clear exactly why the 
PhoenixManagerFactory class needs to exist and what it's doing

One additional meta note is a bit wider than this once ticket, but it's 
something to be considered. There is currently a CSV import tool and Pig 
integration thath are both handling bulk writing to Phoenix, and this is adding 
in a third implementation of that (basically mapping input records into Phoenix 
prepared statements). It would be good to get these three implementations 
aligned, although it doesn't necessarily have to happen as part of this ticket.


> Support for Sqoop
> -----------------
>
>                 Key: PHOENIX-763
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-763
>             Project: Phoenix
>          Issue Type: Task
>    Affects Versions: 3.0.0
>            Reporter: James Taylor
>            Assignee: mravi
>              Labels: patch
>         Attachments: PHOENIX-763.patch
>
>
> Not sure anything required from our end, but you should be able to use Sqoop 
> to create and populate Phoenix tables.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (PHOENIX-763) Support for Sqoop

Reply via email to