New reply on DataCleaner's online discussion forum (http://datacleaner.org/forum):
Kishore Veleti replied to subject 'How to connect to Hive using Data Cleaner?' ------------------- Hi Kasper, Thanks for your reply, I was planning to try was busy y'day so posted the question. Today I tried it and here is my experience. Overall I am able to achieve some part i.e. connect to Hive database, list database & tables, run Quick Analysis. I liked DataCleaner product and just thought will share my experience with Hive and Data Cleaner integration. End of the day I could not “Add a Table column to a Job configuration page in Desktop application and continue further”. Here comes my day with Data Cleaner Desktop application + Hive integration - Point 1 : What version of Hive JDBC I am suing - 0.13 Point 2 : Fact - Hive JDBC JAR file depends on other Hive jar files like hive-service, hive-cli etc Point 3 : Registering the Hive JDBC driver ============================ In Data Cleaner I registered the Hive driver as “Other JDBC driver” by specifying Hive Driver class name (org.hadoop.hive.jdbc.HiveDriver) and specifying the hive-jdbc jar file location) DataCleaner successfully registered the driver without any issue! Point 4 : I could not use above Hive driver registration immediately and found out that I have to restart the Data Cleaner Desktop application. So restarted the Data Cleaner Desktop application Point 5 : Create a Hive data store (eventually FAILED though, check my notes below) ==================================================================== In Data Cleaner desktop app Home page where we can Register a data source, I clicked on “More” button, selected Other option. In the window specified a name to the source as “Local Host Hive Source”, specified the Hive JDBC driver class name, JDBC URI (DO NOT select any value in the drop down where you will find MongoDB, SugarCRM and other data types) Click on Test Connection. FAILED!!!! Got an error because Hive JDBC driver is depending on its other classes which are not part of Hive JDBC jar file. Placed those jar files in the “lib” directory of DataCleaner desktop app and restarted the desktop app. It did not worked. Later found that “datacleaner.sh” Desktop app startup script is using executable jar file (DataCleander.jar) with “lib/…” added in the jar’s META-INF file Just for testing purpose removed the “-jar" option in datacleander.sh and added classpath plus specified exact Main class in the datacleaner.sh Somehow above “-jar" removal did not worked, found out that "DataCleaner-DesktopApp.jar" also has “executable Jar" file behaviour with jar files specified in the "DataCleaner-DesktopApp.jar". I assumed this could be the problem. Decided not proceed with playing around with datacleaner.sh file. Tried a different approach, for this downloaded the DataCleaner source code, browsed through its source code for some thoughts and finally tried below approach which WORKED!!! Point 6 : How I resolved Hive dependent jar files issue and created a Hive data store (eventually FAILED again in creating the Hive data store) ==================================================================== I decided will create a “maven-sharded" HIve with dependent jar files which include all of its classes in a single jar file Created a Maven project, specified hive-jdbc, hadoop-common maven dependencies and created a single jar file containing all hive-jdbc and its depedent jar classes. Closed the DataCleaner desktop app. Deleted the userpreferences.dat file inside the DataCleaner installation ( I felt that old Hive JDBC jar file driver registry which I did above might create issues - I might be wrong here - just to be on the saer side just deleted the userpreferences.data file ) Started the DataCleaner desktop app again. Registered the Hive JDBC driver by specifying the custom hive-jdbc-with-dependencies.jar file location. Click on Test Connection. FAILED!!!! Point 6 : How I resolved Hive dependent jar files issue and created a Hive data store (eventually SUCCESS in creating the Hive data store) 😊 ==================================================================== Found that above hive-jdbc-with-dependencies.jar file has so may META-INF folders which are inherited from other hive-jdbc dependencies jar file Googled and found that above META-INF issues is a common behaviour with maven jar with dependencies and we need to add below configuration in maven build jar-with-depencies plugin <configuration> <filters> <artifact>*:*</artifact> <excludes> <exclude>META-INF/*.SF</exclude> <exclude>META-INF/*.DSA</exclude> <exclude>META-INF/*.RSA</exclude> </excludes> </filters> </configuration> (please done ask me why above filters I just used whatever I found in my Googling) Recreated the hive-jdbc-with-dependencies.jar file and followed the same step as detailed above "deleting the userpreferences.dat".... Click on Test Connection. SUCCESS!!!!!!!!! 😊 I am able to see the Hive databases and Tables. I clicked on a Table and selected "Quick Analysis". Did not understand what is happening because nothing was displaying in DataCleaner desktop app. Later found that a Hive MapReduce program is executing in the backend and FINALLY RESULTS are shown SUCCESS again !!! 😊 Now I tried to add a column to the Job it failed. By looking at the exception stack trace found that DataCleaner internally using Apache MetaModel and it is trying to find the getIndexes of selected column which I think not supported for Hive. So I could not proceed further but at least got the Quick Analysis. Since I cannot do anything further now I will stop here. Above is what I did and I might be wrong in some of my analysis or thinking. I liked Data Cleaner product that is why spent sometime more than planned in integrating with Hive Thanks, Kishore Veleti A.V.K. ------------------- View the topic online to reply - go to http://datacleaner.org/topic/1044/How-to-connect-to-Hive-using-Data-Cleaner%3F -- You received this message because you are subscribed to the Google Groups "DataCleaner-notify" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/datacleaner-notify. For more options, visit https://groups.google.com/d/optout.
