New reply on DataCleaner's online discussion forum 
(http://datacleaner.org/forum):

Kishore Veleti replied to subject 'How to connect to Hive using Data Cleaner?'

-------------------

Hi Kasper,

Thanks for your reply, I was planning to try was busy y'day so posted the 
question.

Today I tried it and here is my experience.  Overall I am able to achieve some 
part i.e. connect to Hive database, list database & tables, run Quick Analysis. 

I liked DataCleaner product and just thought will share my experience with Hive 
and Data Cleaner integration.

End of the day I could not “Add a Table column to a Job configuration page in 
Desktop application and continue further”.

Here comes my day with Data Cleaner Desktop application + Hive integration -

Point 1 : What version of Hive JDBC I am suing - 0.13

Point 2 : Fact - Hive JDBC JAR file depends on other Hive jar files like 
hive-service, hive-cli etc

Point 3 : Registering the Hive JDBC driver
============================
In Data Cleaner I registered the Hive driver as “Other JDBC driver” by 
specifying Hive Driver class name (org.hadoop.hive.jdbc.HiveDriver) and 
specifying the hive-jdbc jar file location)

DataCleaner successfully registered the driver without any issue!

Point 4 : I could not use above Hive driver registration immediately and found 
out that I have to restart the Data Cleaner Desktop application.

So restarted the Data Cleaner Desktop application

Point 5 : Create a Hive data store (eventually FAILED though, check my notes 
below)
====================================================================

In Data Cleaner desktop app Home page where we can Register a data source, I 
clicked on “More” button, selected Other option.

In the window specified a name to the source as “Local Host Hive Source”, 
specified the Hive JDBC driver class name, JDBC URI (DO NOT select any value in 
the drop down where you will find MongoDB, SugarCRM and other data types)

Click on Test Connection.

FAILED!!!!

Got an error because Hive JDBC driver is depending on its other classes which 
are not part of Hive JDBC jar file.

Placed those jar files in the “lib” directory of DataCleaner desktop app and 
restarted the desktop app. It did not worked.

Later found that “datacleaner.sh” Desktop app startup script is using 
executable jar file (DataCleander.jar) with “lib/…” added in the jar’s META-INF 
file

Just for testing purpose removed the “-jar" option in datacleander.sh and added 
classpath plus specified exact Main class in the datacleaner.sh

Somehow above “-jar" removal did not worked, found out that 
"DataCleaner-DesktopApp.jar" also has “executable Jar" file behaviour with jar 
files specified in the  "DataCleaner-DesktopApp.jar". I assumed this could be 
the problem.

Decided not proceed with playing around with datacleaner.sh file.

Tried a different approach, for this downloaded the DataCleaner source code, 
browsed through its source code for some thoughts and finally tried below 
approach which WORKED!!!

Point 6 : How I resolved Hive dependent jar files issue and created a Hive data 
store (eventually FAILED again in creating the Hive data store)
====================================================================

I decided will create a “maven-sharded" HIve with dependent jar files which 
include all of its classes in a single jar file

Created a Maven project, specified hive-jdbc, hadoop-common maven dependencies 
and created a single jar file containing all hive-jdbc and its depedent jar 
classes.

Closed the DataCleaner desktop app.

Deleted the userpreferences.dat file inside the DataCleaner installation ( I 
felt that old Hive JDBC jar file driver registry which I did above might create 
issues - I might be wrong here  - just to be on the saer side just deleted the 
userpreferences.data file ) 

Started the DataCleaner desktop app again.

Registered the Hive JDBC driver by specifying the custom 
hive-jdbc-with-dependencies.jar file location.

Click on Test Connection.


FAILED!!!!

Point 6 : How I resolved Hive dependent jar files issue and created a Hive data 
store (eventually SUCCESS in creating the Hive data store) 😊
====================================================================
Found that above hive-jdbc-with-dependencies.jar file has so may META-INF 
folders which are inherited from other hive-jdbc dependencies jar file

Googled and found that above META-INF issues is a common behaviour with maven 
jar with dependencies and we need to add below configuration in maven build 
jar-with-depencies plugin

<configuration>

<filters>

<artifact>*:*</artifact>
<excludes>

<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>

<exclude>META-INF/*.RSA</exclude>
</excludes>
</filters>
</configuration>


(please done ask me why above filters I just used whatever I found in my 
Googling)

Recreated the  hive-jdbc-with-dependencies.jar file  and followed the same step 
as detailed above "deleting the userpreferences.dat"....

Click on Test Connection.


SUCCESS!!!!!!!!!

😊

I am able to see the Hive databases and Tables.

I clicked on a Table and selected "Quick Analysis". Did not understand what is 
happening because nothing was displaying in DataCleaner desktop app.

Later found that a Hive MapReduce program is executing in the backend and 
FINALLY RESULTS are shown

SUCCESS again !!!
😊
 
Now I tried to add a column to the Job it failed. By looking at the exception 
stack trace found that DataCleaner internally using Apache MetaModel and it is 
trying to find the getIndexes of selected column which I think not supported 
for Hive.

So I could not proceed further but at least got the Quick Analysis. Since I 
cannot do anything further now I will stop here. 

Above is what I did and I might be wrong in some of my analysis or thinking.  I 
liked Data Cleaner product that is why spent sometime more than planned in 
integrating with Hive 

Thanks,
Kishore Veleti A.V.K.

-------------------

View the topic online to reply - go to 
http://datacleaner.org/topic/1044/How-to-connect-to-Hive-using-Data-Cleaner%3F

-- 
You received this message because you are subscribed to the Google Groups 
"DataCleaner-notify" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/datacleaner-notify.
For more options, visit https://groups.google.com/d/optout.

Reply via email to