[jira] [Commented] (PHOENIX-2632) Easier Hive->Phoenix data movement
[ https://issues.apache.org/jira/browse/PHOENIX-2632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15122106#comment-15122106 ] Josh Mahonin commented on PHOENIX-2632: --- Whoops, attaching previous link I forgot: [1] https://github.com/apache/phoenix/blob/master/phoenix-spark/src/main/scala/org/apache/phoenix/spark/DefaultSource.scala#L40-L44 > Easier Hive->Phoenix data movement > -- > > Key: PHOENIX-2632 > URL: https://issues.apache.org/jira/browse/PHOENIX-2632 > Project: Phoenix > Issue Type: Improvement >Reporter: Randy Gelhausen > > Moving tables or query results from Hive into Phoenix today requires error > prone manual schema re-definition inside HBase storage handler properties. > Since Hive and Phoenix support near equivalent types, it should be easier for > users to pick a Hive table and load it (or derived query results) from it. > I'm posting this to open design discussion, but also submit my own project > https://github.com/randerzander/HiveToPhoenix for consideration as an early > solution. It creates a Spark DataFrame from a Hive query, uses Phoenix JDBC > to "create if not exists" a Phoenix equivalent table, and uses the > phoenix-spark artifact to store the DataFrame into Phoenix. > I'm eager to get feedback if this is interesting/useful to the Phoenix > community. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PHOENIX-2632) Easier Hive->Phoenix data movement
[ https://issues.apache.org/jira/browse/PHOENIX-2632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15121584#comment-15121584 ] Josh Mahonin commented on PHOENIX-2632: --- I think #1 is a great idea. I don't really have any opinions either way on #2. It should be pretty straight-forward to implement, starting from [1], we just need to adjust the SaveMode case statement, and the code to create the table then and there. The various options you use in your config can be passed through from Spark as option parameters (e.g., zkUrl and table). I had originally thought that 'Ignore' would be the right SaveMode to use, but looking through some examples, I'm wondering if we should take the approach where the default 'ErrorIfExists' attempts a 'CREATE TABLE', 'Ignore' will do a 'CREATE TABLE IF NOT EXISTS', and the existing 'Append' mode will just make an attempt to write straight to the specified table. I'm probably a bit ahead of myself, once you open a new JIRA we can work out the details there. > Easier Hive->Phoenix data movement > -- > > Key: PHOENIX-2632 > URL: https://issues.apache.org/jira/browse/PHOENIX-2632 > Project: Phoenix > Issue Type: Improvement >Reporter: Randy Gelhausen > > Moving tables or query results from Hive into Phoenix today requires error > prone manual schema re-definition inside HBase storage handler properties. > Since Hive and Phoenix support near equivalent types, it should be easier for > users to pick a Hive table and load it (or derived query results) from it. > I'm posting this to open design discussion, but also submit my own project > https://github.com/randerzander/HiveToPhoenix for consideration as an early > solution. It creates a Spark DataFrame from a Hive query, uses Phoenix JDBC > to "create if not exists" a Phoenix equivalent table, and uses the > phoenix-spark artifact to store the DataFrame into Phoenix. > I'm eager to get feedback if this is interesting/useful to the Phoenix > community. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PHOENIX-2632) Easier Hive->Phoenix data movement
[ https://issues.apache.org/jira/browse/PHOENIX-2632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15120348#comment-15120348 ] Josh Elser commented on PHOENIX-2632: - [~rgelhau], how do you see something like this getting included into Phoenix? A new Maven module that can sit downstream of phoenix-spark, maybe produce some uber-jar to make classpath stuff easier from the Phoenix side? Any possibility to add some end-to-end tests? Such tests would be nice to have to help catch future breakages as they happen instead of realizing after a release when someone goes to use it. In general, any tooling that can help get your data into Phoenix seems like a valuable addition to me. There are many ways to hammer that nail, but this seems like it would be a reasonably general-purpose one to provide. > Easier Hive->Phoenix data movement > -- > > Key: PHOENIX-2632 > URL: https://issues.apache.org/jira/browse/PHOENIX-2632 > Project: Phoenix > Issue Type: Improvement >Reporter: Randy Gelhausen > > Moving tables or query results from Hive into Phoenix today requires error > prone manual schema re-definition inside HBase storage handler properties. > Since Hive and Phoenix support near equivalent types, it should be easier for > users to pick a Hive table and load it (or derived query results) from it. > I'm posting this to open design discussion, but also submit my own project > https://github.com/randerzander/HiveToPhoenix for consideration as an early > solution. It creates a Spark DataFrame from a Hive query, uses Phoenix JDBC > to "create if not exists" a Phoenix equivalent table, and uses the > phoenix-spark artifact to store the DataFrame into Phoenix. > I'm eager to get feedback if this is interesting/useful to the Phoenix > community. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PHOENIX-2632) Easier Hive->Phoenix data movement
[ https://issues.apache.org/jira/browse/PHOENIX-2632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15120500#comment-15120500 ] Josh Mahonin commented on PHOENIX-2632: --- This looks pretty neat [~rgelhau] I bet there's a way to take your 'CREATE TABLE IF NOT EXISTS' functionality could be wrapped into the existing Spark DataFrame code, and be made to use for the SaveMode.Ignore option [1]. Right now it only supports SaveMode.Overwrite, which assumes the table is setup already. Once that's in, I think the Hive->Phoenix functionality becomes a documentation exercise: show to to setup the Hive table as a DataFrame, then invoke df.save("org.apache.phoenix.spark"...) on it. [1] http://spark.apache.org/docs/latest/sql-programming-guide.html > Easier Hive->Phoenix data movement > -- > > Key: PHOENIX-2632 > URL: https://issues.apache.org/jira/browse/PHOENIX-2632 > Project: Phoenix > Issue Type: Improvement >Reporter: Randy Gelhausen > > Moving tables or query results from Hive into Phoenix today requires error > prone manual schema re-definition inside HBase storage handler properties. > Since Hive and Phoenix support near equivalent types, it should be easier for > users to pick a Hive table and load it (or derived query results) from it. > I'm posting this to open design discussion, but also submit my own project > https://github.com/randerzander/HiveToPhoenix for consideration as an early > solution. It creates a Spark DataFrame from a Hive query, uses Phoenix JDBC > to "create if not exists" a Phoenix equivalent table, and uses the > phoenix-spark artifact to store the DataFrame into Phoenix. > I'm eager to get feedback if this is interesting/useful to the Phoenix > community. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PHOENIX-2632) Easier Hive->Phoenix data movement
[ https://issues.apache.org/jira/browse/PHOENIX-2632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15120717#comment-15120717 ] Randy Gelhausen commented on PHOENIX-2632: -- I would like to see this moved into Phoenix in two ways: 1. [~jmahonin] agreed the "create if not exists" snippet would improve the existing phoenix-spark API integration. I'll look at opening an additional JIRA and submitting a preliminary patch to add it there. 2. I also envision this as a new "executable" module similar to the pre-built bulk CSV loading MR job: HADOOP_CLASSPATH=$(hbase mapredcp):/path/to/hbase/conf hadoop jar phoenix-4.0.0-incubating-client.jar org.apache.phoenix.mapreduce.CsvBulkLoadTool --table EXAMPLE --input /data/example.csv Making the generic "Hive table/query <-> Phoenix" use case bash-scriptable opens the door to users who aren't going to write Spark code just to move data back and forth between Hive and HBase. [~elserj] [~jmahonin] I'm happy to add tests and restructure the existing code for both 1 and 2, but will need some guidance once you decide yea or nay for each. > Easier Hive->Phoenix data movement > -- > > Key: PHOENIX-2632 > URL: https://issues.apache.org/jira/browse/PHOENIX-2632 > Project: Phoenix > Issue Type: Improvement >Reporter: Randy Gelhausen > > Moving tables or query results from Hive into Phoenix today requires error > prone manual schema re-definition inside HBase storage handler properties. > Since Hive and Phoenix support near equivalent types, it should be easier for > users to pick a Hive table and load it (or derived query results) from it. > I'm posting this to open design discussion, but also submit my own project > https://github.com/randerzander/HiveToPhoenix for consideration as an early > solution. It creates a Spark DataFrame from a Hive query, uses Phoenix JDBC > to "create if not exists" a Phoenix equivalent table, and uses the > phoenix-spark artifact to store the DataFrame into Phoenix. > I'm eager to get feedback if this is interesting/useful to the Phoenix > community. -- This message was sent by Atlassian JIRA (v6.3.4#6332)