[
https://issues.apache.org/jira/browse/HDFS-14440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16822359#comment-16822359
]
Ayush Saxena commented on HDFS-14440:
-------------------------------------
Thanx [~elgoiri] for the opinions.
It isn't a issue that fixes or does some changes to the functionality, so I am
absolutely OK, if you think it isn't worth to have.
On a thought, This just tend to save some time efforts in the file write
process, I am not sure a WRITE operation is that critical or not, but
READ/WRITE are an elementary operation for every FileSystem, everything
revolves around.
After the change I didn't find any case where the time consumed in general more
than the time which was consumed earlier, Yes, there would be extra RPC in case
of failure where last block was present, if thats not present we go for
additional getfileInfo calls too. If thats the case still we shall be in a
better time value than the previous ones.
Now if we consider normal writes, i.e where a file write is success and most of
our write gets succeeded, we would see in our practical deployments, the number
of writes succeeded would be more as compared to the writes failing for File
Already exist. So here we had to check all the namespaces for the availability
of files, this is an operation we don't actually do when we directly talk to
the namenode, or there is a single destination, this is an overhead as part of
our multi-destination framework, and most importantly it is a required OP, we
can't go away without checking. This time taken is proportional as of now to
the number of Namespaces, N times. This only we optimized to 1 time unit and
made independent of the number of namespaces, The scalability that we achieve
is by spawning a new namespace and having it as an extra destination, Just
wanted to make sure that it should incur least or no cost for the elementary
process. It is something won't be that much observed if the networks and
processing is too fast, but on average cases, it shows up.
I might have seen it from a far away distance, Let me know if it interests
you.:)
> RBF: Optimize the file write process in case of multiple destinations.
> ----------------------------------------------------------------------
>
> Key: HDFS-14440
> URL: https://issues.apache.org/jira/browse/HDFS-14440
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Reporter: Ayush Saxena
> Assignee: Ayush Saxena
> Priority: Major
> Attachments: HDFS-14440-HDFS-13891-01.patch
>
>
> In case of multiple destinations, We need to check if the file already exists
> in one of the subclusters for which we use the existing getBlockLocation()
> API which is by default a sequential Call,
> In an ideal scenario where the file needs to be created each subcluster shall
> be checked sequentially, this can be done concurrently to save time.
> In another case where the file is found and if the last block is null, we
> need to do getFileInfo to all the locations to get the location where the
> file exists. This also can be prevented by use of ConcurrentCall since we
> shall be having the remoteLocation to where the getBlockLocation returned a
> non null entry.
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]