[ 
https://issues.apache.org/jira/browse/ATLAS-164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14802696#comment-14802696
 ] 

Rémy SAISSY commented on ATLAS-164:
-----------------------------------

Hi Venkatesh,
thanks.

* DfsDataModel 
I agree, at first I started by considering three classes: file, dir and symlink.
I reverted back to 1:1 mapping because handling symlink required to use two 
different properties wether I had a file or a directory target. I thought that 
it would not be an issue to map inodes since the query language enables to show 
files, dirs and symlinks separately.

A question, can we model class inheritance? If so, I could have dir,file and 
symlink classes to inherit from inode and provide a clean symlink_target 
attribute with the parent class as the type.

* Import

Thanks for the pointer, I will check how falcon does it. Appart from the 
technical standpoint, I will also document myself a bit one regulatory needs 
since implementing as data sets is that it will reduce the granularity thus 
maybe it might not be precise enough for some regulatory needs.
Also, I see two approaches to data sets:
 - one that requires to manually define data sets using the webapp so the 
bridge will log only those data sets (and forget about the other events on HDFS)
 - one that consider that a data set is a non-recursive directory. Any action 
on a file will log an event for its directory

The latter has the advantage to process all actions in HDFS and to be easier to 
configure and use for the end user so I would prefer it.

* Lineage

This is because I haven't yet fully understood how lineage should be handled in 
by Atlas addons.
 - should I also keep track of who executed what action on a data set / file / 
dir / symlink? I haven't seen support for it in the hive-bridge but I guess it 
is required to comply with regulatory needs.

Speaking about set of files consumed by a PIG,MR,Spark or whatever job, since 
HDFS sees actions as they happen, I see two approaches:
 - HDFS level: considering a data set as being a non-recursive directory. That 
would be a lot of events but all for the same node in Atlas (the source / 
target directory of the job)
 - processing framework level: hook an addon for each framework that log events 
into atlas on the same data as the hdfs bridge ones.

--> I prefer doing it at the HDFS level only. It is more generic.

* Unit Tests

I've made a typo, I meant the integration test.


> DFS addon for Atlas
> -------------------
>
>                 Key: ATLAS-164
>                 URL: https://issues.apache.org/jira/browse/ATLAS-164
>             Project: Atlas
>          Issue Type: New Feature
>    Affects Versions: 0.6-incubating
>            Reporter: Rémy SAISSY
>            Assignee: Rémy SAISSY
>         Attachments: ATLAS-164.15092015.patch, ATLAS-164.15092015.patch
>
>
> Hi,
> I have wrote an addon for sending DFS metadata into Atlas.
> The patch is attached.
> However, I have a hard time getting the unit tests working properly thus some 
> advices would be welcome.
> Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to