[
https://issues.apache.org/jira/browse/ATLAS-904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15333127#comment-15333127
]
Hemanth Yamijala commented on ATLAS-904:
----------------------------------------
[~suma.shivaprasad], I don't know the full details of the code - so please
treat my review comments with a pinch of salt.
There are 2 changes that this patch covers:
* ATLAS-877 - Here we switched from getting a DDL time to the create time. This
change seems right to me. +1.
* ATLAS-904 - What we have essentially done is that instead of fixing the bug
reported here, we have removed the cause of the bug - i.e the normalize method
itself.
So, let's focus on the removal of the normalization. From what I understand,
this was done because Atlas currently does not model partition level lineage.
Hence, by removing literals in queries involving 2 sets of DataSets (inputs &
outputs), we were 'normalizing' partition changes to become like table level
changes. Further, we were capturing the most recent query that ran on this set.
(It appears that this was an array of latest queries, but I don't know if we
were appending to the array, or would it be a replace - in which case we would
capture only the latest query).
I think until we support partition level lineage, sticking to the above model
is useful. If normalization is costly, as it seems the Hive SMEs are telling
us, then can we just make the process name very generic capturing {set of
inputs} -> {set of outputs} in sorted order of input and output names? We could
still store the actual query (unnormalized) into array of latest queries (I
would prefer this is a bounded array - say the last 100 or configurable number
of entries??).
I believe this is a more usable solution than showing all the original
unnormalized queries - which could be very large for all that we know. Please
let me know if this makes sense.
> Hive hook fails due to session state not being set
> --------------------------------------------------
>
> Key: ATLAS-904
> URL: https://issues.apache.org/jira/browse/ATLAS-904
> Project: Atlas
> Issue Type: Bug
> Affects Versions: 0.7-incubating
> Reporter: Suma Shivaprasad
> Assignee: Suma Shivaprasad
> Priority: Blocker
> Fix For: 0.7-incubating
>
> Attachments: ATLAS-904.1.patch, ATLAS-904.patch
>
>
> {noformat}
> 2016-06-15 11:34:30,423 WARN [Atlas Logger 0]: hook.HiveHook
> (HiveHook.java:normalize(557)) - Could not rewrite query due to error.
> Proceeding with original query EXPORT TABLE test_export_table to
> 'hdfs://localhost:9000/hive_tables/test_path1'
> java.lang.NullPointerException: Conf non-local session path expected to be
> non-null
> at
> com.google.common.base.Preconditions.checkNotNull(Preconditions.java:204)
> at
> org.apache.hadoop.hive.ql.session.SessionState.getHDFSSessionPath(SessionState.java:641)
> at org.apache.hadoop.hive.ql.Context.<init>(Context.java:133)
> at org.apache.hadoop.hive.ql.Context.<init>(Context.java:120)
> at
> org.apache.atlas.hive.rewrite.HiveASTRewriter.<init>(HiveASTRewriter.java:44)
> at org.apache.atlas.hive.hook.HiveHook.normalize(HiveHook.java:554)
> at
> org.apache.atlas.hive.hook.HiveHook.getProcessReferenceable(HiveHook.java:702)
> at
> org.apache.atlas.hive.hook.HiveHook.registerProcess(HiveHook.java:596)
> at org.apache.atlas.hive.hook.HiveHook.fireAndForget(HiveHook.java:222)
> at org.apache.atlas.hive.hook.HiveHook.access$200(HiveHook.java:77)
> at org.apache.atlas.hive.hook.HiveHook$2.run(HiveHook.java:182)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> 2016-06-15 11:34:30,423 ERROR [Atlas Logger 0]: hook.HiveHook
> (HiveHook.java:run(184)) - Atlas hook failed due to error
> java.lang.NullPointerException
> at java.lang.StringBuilder.<init>(StringBuilder.java:109)
> at
> org.apache.atlas.hive.hook.HiveHook.getProcessQualifiedName(HiveHook.java:738)
> at
> org.apache.atlas.hive.hook.HiveHook.getProcessReferenceable(HiveHook.java:703)
> at
> org.apache.atlas.hive.hook.HiveHook.registerProcess(HiveHook.java:596)
> at org.apache.atlas.hive.hook.HiveHook.fireAndForget(HiveHook.java:222)
> at org.apache.atlas.hive.hook.HiveHook.access$200(HiveHook.java:77)
> at org.apache.atlas.hive.hook.HiveHook$2.run(HiveHook.java:182)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)