[jira] [Commented] (PHOENIX-2632) Easier Hive->Phoenix data movement

2016-01-28 Thread Josh Mahonin (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-2632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15122106#comment-15122106
 ] 

Josh Mahonin commented on PHOENIX-2632:
---

Whoops, attaching previous link I forgot:

[1] 
https://github.com/apache/phoenix/blob/master/phoenix-spark/src/main/scala/org/apache/phoenix/spark/DefaultSource.scala#L40-L44

> Easier Hive->Phoenix data movement
> --
>
> Key: PHOENIX-2632
> URL: https://issues.apache.org/jira/browse/PHOENIX-2632
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Randy Gelhausen
>
> Moving tables or query results from Hive into Phoenix today requires error 
> prone manual schema re-definition inside HBase storage handler properties. 
> Since Hive and Phoenix support near equivalent types, it should be easier for 
> users to pick a Hive table and load it (or derived query results) from it.
> I'm posting this to open design discussion, but also submit my own project 
> https://github.com/randerzander/HiveToPhoenix for consideration as an early 
> solution. It creates a Spark DataFrame from a Hive query, uses Phoenix JDBC 
> to "create if not exists" a Phoenix equivalent table, and uses the 
> phoenix-spark artifact to store the DataFrame into Phoenix.
> I'm eager to get feedback if this is interesting/useful to the Phoenix 
> community.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PHOENIX-2632) Easier Hive->Phoenix data movement

2016-01-28 Thread Josh Mahonin (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-2632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15121584#comment-15121584
 ] 

Josh Mahonin commented on PHOENIX-2632:
---

I think #1 is a great idea. I don't really have any opinions either way on #2.

It should be pretty straight-forward to implement, starting from [1], we just 
need to adjust the SaveMode case statement, and the code to create the table 
then and there. The various options you use in your config can be passed 
through from Spark as option parameters (e.g., zkUrl and table).

I had originally thought that 'Ignore' would be the right SaveMode to use, but 
looking through some examples, I'm wondering if we should take the approach 
where the default 'ErrorIfExists' attempts a 'CREATE TABLE', 'Ignore' will do a 
'CREATE TABLE IF NOT EXISTS', and the existing 'Append' mode will just make an 
attempt to write straight to the specified table.

I'm probably a bit ahead of myself, once you open a new JIRA we can work out 
the details there.

> Easier Hive->Phoenix data movement
> --
>
> Key: PHOENIX-2632
> URL: https://issues.apache.org/jira/browse/PHOENIX-2632
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Randy Gelhausen
>
> Moving tables or query results from Hive into Phoenix today requires error 
> prone manual schema re-definition inside HBase storage handler properties. 
> Since Hive and Phoenix support near equivalent types, it should be easier for 
> users to pick a Hive table and load it (or derived query results) from it.
> I'm posting this to open design discussion, but also submit my own project 
> https://github.com/randerzander/HiveToPhoenix for consideration as an early 
> solution. It creates a Spark DataFrame from a Hive query, uses Phoenix JDBC 
> to "create if not exists" a Phoenix equivalent table, and uses the 
> phoenix-spark artifact to store the DataFrame into Phoenix.
> I'm eager to get feedback if this is interesting/useful to the Phoenix 
> community.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PHOENIX-2632) Easier Hive->Phoenix data movement

2016-01-27 Thread Josh Elser (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-2632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15120348#comment-15120348
 ] 

Josh Elser commented on PHOENIX-2632:
-

[~rgelhau], how do you see something like this getting included into Phoenix? A 
new Maven module that can sit downstream of phoenix-spark, maybe produce some 
uber-jar to make classpath stuff easier from the Phoenix side?

Any possibility to add some end-to-end tests? Such tests would be nice to have 
to help catch future breakages as they happen instead of realizing after a 
release when someone goes to use it.

In general, any tooling that can help get your data into Phoenix seems like a 
valuable addition to me. There are many ways to hammer that nail, but this 
seems like it would be a reasonably general-purpose one to provide.

> Easier Hive->Phoenix data movement
> --
>
> Key: PHOENIX-2632
> URL: https://issues.apache.org/jira/browse/PHOENIX-2632
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Randy Gelhausen
>
> Moving tables or query results from Hive into Phoenix today requires error 
> prone manual schema re-definition inside HBase storage handler properties. 
> Since Hive and Phoenix support near equivalent types, it should be easier for 
> users to pick a Hive table and load it (or derived query results) from it.
> I'm posting this to open design discussion, but also submit my own project 
> https://github.com/randerzander/HiveToPhoenix for consideration as an early 
> solution. It creates a Spark DataFrame from a Hive query, uses Phoenix JDBC 
> to "create if not exists" a Phoenix equivalent table, and uses the 
> phoenix-spark artifact to store the DataFrame into Phoenix.
> I'm eager to get feedback if this is interesting/useful to the Phoenix 
> community.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PHOENIX-2632) Easier Hive->Phoenix data movement

2016-01-27 Thread Josh Mahonin (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-2632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15120500#comment-15120500
 ] 

Josh Mahonin commented on PHOENIX-2632:
---

This looks pretty neat [~rgelhau]

I bet there's a way to take your 'CREATE TABLE IF NOT EXISTS' functionality 
could be wrapped into the existing Spark DataFrame code, and be made to use for 
the SaveMode.Ignore option [1]. Right now it only supports SaveMode.Overwrite, 
which assumes the table is setup already.

Once that's in, I think the Hive->Phoenix functionality becomes a documentation 
exercise: show to to setup the Hive table as a DataFrame, then invoke 
df.save("org.apache.phoenix.spark"...) on it.

[1] http://spark.apache.org/docs/latest/sql-programming-guide.html



> Easier Hive->Phoenix data movement
> --
>
> Key: PHOENIX-2632
> URL: https://issues.apache.org/jira/browse/PHOENIX-2632
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Randy Gelhausen
>
> Moving tables or query results from Hive into Phoenix today requires error 
> prone manual schema re-definition inside HBase storage handler properties. 
> Since Hive and Phoenix support near equivalent types, it should be easier for 
> users to pick a Hive table and load it (or derived query results) from it.
> I'm posting this to open design discussion, but also submit my own project 
> https://github.com/randerzander/HiveToPhoenix for consideration as an early 
> solution. It creates a Spark DataFrame from a Hive query, uses Phoenix JDBC 
> to "create if not exists" a Phoenix equivalent table, and uses the 
> phoenix-spark artifact to store the DataFrame into Phoenix.
> I'm eager to get feedback if this is interesting/useful to the Phoenix 
> community.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PHOENIX-2632) Easier Hive->Phoenix data movement

2016-01-27 Thread Randy Gelhausen (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-2632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15120717#comment-15120717
 ] 

Randy Gelhausen commented on PHOENIX-2632:
--

I would like to see this moved into Phoenix in two ways:

1. [~jmahonin] agreed the "create if not exists" snippet would improve the 
existing phoenix-spark API integration. I'll look at opening an additional JIRA 
and submitting a preliminary patch to add it there.

2. I also envision this as a new "executable" module similar to the pre-built 
bulk CSV loading MR job: HADOOP_CLASSPATH=$(hbase mapredcp):/path/to/hbase/conf 
hadoop jar phoenix-4.0.0-incubating-client.jar 
org.apache.phoenix.mapreduce.CsvBulkLoadTool --table EXAMPLE --input 
/data/example.csv

Making the generic "Hive table/query <-> Phoenix" use case bash-scriptable 
opens the door to users who aren't going to write Spark code just to move data 
back and forth between Hive and HBase.

[~elserj] [~jmahonin] I'm happy to add tests and restructure the existing code 
for both 1 and 2, but will need some guidance once you decide yea or nay for 
each.

> Easier Hive->Phoenix data movement
> --
>
> Key: PHOENIX-2632
> URL: https://issues.apache.org/jira/browse/PHOENIX-2632
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Randy Gelhausen
>
> Moving tables or query results from Hive into Phoenix today requires error 
> prone manual schema re-definition inside HBase storage handler properties. 
> Since Hive and Phoenix support near equivalent types, it should be easier for 
> users to pick a Hive table and load it (or derived query results) from it.
> I'm posting this to open design discussion, but also submit my own project 
> https://github.com/randerzander/HiveToPhoenix for consideration as an early 
> solution. It creates a Spark DataFrame from a Hive query, uses Phoenix JDBC 
> to "create if not exists" a Phoenix equivalent table, and uses the 
> phoenix-spark artifact to store the DataFrame into Phoenix.
> I'm eager to get feedback if this is interesting/useful to the Phoenix 
> community.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)