[
https://issues.apache.org/jira/browse/SPARK-19715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15884230#comment-15884230
]
Steve Loughran edited comment on SPARK-19715 at 2/25/17 1:24 PM:
-----------------------------------------------------------------
OK. I'd recommend going with {{Path.getURI.getPath()}} to get the full path,
though there's the always the risk of >1 s3a bucket referring to the same
objects
Some filesystems (HDFS) have checksums you can ask for, though S3a doesn't,
yet: HADOOP-13282 has discussed serving up etags, primarily to aid distcp
updates. If added, you could use that as the differentiator, or at least to
identify changed files. Patches welcome, [with
tests|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/testing.md]
To be ruthless, it may have been simpler for the user just to edit the
fs.s3n.impl binding to point to S3AFileSystem.class & then left the URLs the
same
was (Author: [email protected]):
OK. I'd recommend going twith Path.getURI.getPath() to get the full path,
though there's the always the risk of >1 s3a bucket referring to the same
objects
Some filesystems (HDFS, file:) have checksums you can ask for, though S3a
doesn't, yet: HADOOP-13282 has discussed serving up etags, primarily to aid
distcp updates. If added, you could use that as the differentiator, or at least
to identify changed files
To be ruthless, it may have been simpler for the user just to edit the
fs.s3n.impl binding to point to S3AFileSystem.class & then left the URLs the
same
> Option to Strip Paths in FileSource
> -----------------------------------
>
> Key: SPARK-19715
> URL: https://issues.apache.org/jira/browse/SPARK-19715
> Project: Spark
> Issue Type: New Feature
> Components: Structured Streaming
> Affects Versions: 2.1.0
> Reporter: Michael Armbrust
>
> Today, we compare the whole path when deciding if a file is new in the
> FileSource for structured streaming. However, this cause cause false
> negatives in the case where the path has changed in a cosmetic way (i.e.
> changing s3n to s3a). We should add an option {{fileNameOnly}} that causes
> the new file check to be based only on the filename (but still store the
> whole path in the log).
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]