Hi Fokko

Spark fires it off for many other things. It does so for ML pipelines and
it does make information available for data frames.

We use S3 in this case I just simplified the example. It is important to
know what process took what action. Only spark knows this and it does
supply this information at other occasions.

So I don't think your comment makes sense?

Cheers
Bolke

Op ma 15 okt. 2018 19:05 schreef Driesprong, Fokko <fo...@driesprong.frl>:

> Hi Bolke,
>
> I would argue that Spark is not the right level of abstraction of doing
> this. I would create a wrapper around the particular filesystem:
> http://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html
> Therefore you can write a wrapper around the LocalFileSystem if data will
> be written to local disk, DistributedFileSystem when written to HDFS, and
> also many object stores implements this interface. My 2¢
>
> Cheers, Fokko
>
> Op ma 15 okt. 2018 om 18:58 schreef Bolke de Bruin <bdbr...@gmail.com>:
>
>> Hi,
>>
>> Apologies upfront if this should have gone to user@ but it seems a
>> developer question so here goes.
>>
>> We are trying to improve a listener to track lineage across our platform.
>> This requires tracking where data comes from and where it goes to. E.g.
>>
>> sc.setLogLevel("INFO");
>> val data = sc.textFile("hdfs://migration/staffingsec/Mydata.gz")
>> data.saveAsTextFile ("hdfs://datalab/user/xxx”);
>>
>> In this case we would like to know that Spark picked up “Mydata.gz” and
>> wrote it to “xxx”. Of course more complex examples are possible.
>>
>> In the particular case of the above Spark (2.3.2) does not seem trigger
>> any events, or at least not that we know of that give us the relevant
>> information.
>>
>> Is that a correct assessment? What can we do to get that information
>> without knowing the code upfront? Should we provide a patch?
>>
>> Thanks
>> Bolke
>>
>> Verstuurd vanaf mijn iPad
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

Reply via email to