[
https://issues.apache.org/jira/browse/ATLAS-164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14802696#comment-14802696
]
Rémy SAISSY commented on ATLAS-164:
-----------------------------------
Hi Venkatesh,
thanks.
* DfsDataModel
I agree, at first I started by considering three classes: file, dir and symlink.
I reverted back to 1:1 mapping because handling symlink required to use two
different properties wether I had a file or a directory target. I thought that
it would not be an issue to map inodes since the query language enables to show
files, dirs and symlinks separately.
A question, can we model class inheritance? If so, I could have dir,file and
symlink classes to inherit from inode and provide a clean symlink_target
attribute with the parent class as the type.
* Import
Thanks for the pointer, I will check how falcon does it. Appart from the
technical standpoint, I will also document myself a bit one regulatory needs
since implementing as data sets is that it will reduce the granularity thus
maybe it might not be precise enough for some regulatory needs.
Also, I see two approaches to data sets:
- one that requires to manually define data sets using the webapp so the
bridge will log only those data sets (and forget about the other events on HDFS)
- one that consider that a data set is a non-recursive directory. Any action
on a file will log an event for its directory
The latter has the advantage to process all actions in HDFS and to be easier to
configure and use for the end user so I would prefer it.
* Lineage
This is because I haven't yet fully understood how lineage should be handled in
by Atlas addons.
- should I also keep track of who executed what action on a data set / file /
dir / symlink? I haven't seen support for it in the hive-bridge but I guess it
is required to comply with regulatory needs.
Speaking about set of files consumed by a PIG,MR,Spark or whatever job, since
HDFS sees actions as they happen, I see two approaches:
- HDFS level: considering a data set as being a non-recursive directory. That
would be a lot of events but all for the same node in Atlas (the source /
target directory of the job)
- processing framework level: hook an addon for each framework that log events
into atlas on the same data as the hdfs bridge ones.
--> I prefer doing it at the HDFS level only. It is more generic.
* Unit Tests
I've made a typo, I meant the integration test.
> DFS addon for Atlas
> -------------------
>
> Key: ATLAS-164
> URL: https://issues.apache.org/jira/browse/ATLAS-164
> Project: Atlas
> Issue Type: New Feature
> Affects Versions: 0.6-incubating
> Reporter: Rémy SAISSY
> Assignee: Rémy SAISSY
> Attachments: ATLAS-164.15092015.patch, ATLAS-164.15092015.patch
>
>
> Hi,
> I have wrote an addon for sending DFS metadata into Atlas.
> The patch is attached.
> However, I have a hard time getting the unit tests working properly thus some
> advices would be welcome.
> Thanks.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)