On Fri, May 15, 2009 at 5:05 PM, Aaron Kimball <aa...@cloudera.com> wrote: > Hi all, > > For the database import tool I'm writing (Sqoop; HADOOP-5815), in addition > to uploading data into HDFS and using MapReduce to load/transform the data, > I'd like to integrate more closely with Hive. Specifically, to run the > CREATE TABLE statements needed to automatically inject table defintions into > Hive's metastore for the data files that sqoop loads into HDFS. Doing this > requires linking against Hive in some way (either directly by using one of > their API libraries, or "loosely" by piping commands into a Hive instance). > > In either case, there's a dependency there. I was hoping someone on this > list with more Ivy experience than I knows what's the best way to make this > happen. Hive isn't in the maven2 repository that Hadoop pulls most of its > dependencies from. It might be necessary for sqoop to have access to a full > build of Hive. It doesn't seem like a good idea to check that binary > distribution into Hadoop svn, but I'm not sure what's the most expedient > alternative. Is it acceptable to just require that developers who wish to > compile/test/run sqoop have a separate standalone Hive deployment and a > proper HIVE_HOME variable? This would keep our source repo "clean." The > downside here is that it makes it difficult to test Hive-specific > integration functionality with Hudson and requires extra leg-work of > developers. > > Thanks, > - Aaron Kimball >
Aaron, I have a similar situation. I am using the GPL geo-ip library as a hive UDF. Due to apache/GPL issues it the code would not be compatible. Currently my build process reference all if the Hive lib/*.jar files. It does not really need all of that but not being exactly sure what I need I reference all of them. I was thinking one option is to run a GIT system. This way I can integrate my patch into my forked hive. I see your problem though, you have a few Hive Entry Points 1) JDBC 2) Hive Thrift Server 3) scripting 4) Java API The JDBC and Thrift should be the lightest. In that a few Jar files would make up the entry point rather then the entire hive distribution. Although now that Hive has had two releases maybe Hive should be in maven. With that hive could be an optional or a mandatory ant target for sqoop.