Hi Ayush,

Thank you for summary and re-launching this discussion. Both of JIRA's
links are very helpful for pushing it forward.
I would like to underline that this is a serious issue and very hope that
we could get the solution here. As far as know, there are many users who
deploy RBF meet this issue (such as HDFS-15079,HDFS-15310 reported) or
undetected issue.

I would like to give my +1 for improve at RPC framework. IMO, it is more
common and general solution(HADOOP-16254 is one attempt), of course, we
should solve security issues as Daryn concerns.
In my practice, we enhance RPC framework and expose some proxy interface
for Router, Data Locality and Data Inconsistency are both resolved as
expected.
Any suggestions and feedbacks are welcome.

Thanks and best regards.
Hexiaoqiao


On Mon, May 4, 2020 at 5:26 PM Ayush Saxena <ayush...@gmail.com> wrote:

> Hi All,
> Wanted to share and discuss a problem that we are facing in the present
> situation when using Router Based Federation. Presently when a client
> connects through Router to Namenode, the Namenode receives the caller
> context of the router rather than being of the actual client. This
> typically can cause a couple of problems, Two of which we have identified
> as of now :
>
> Firstly, The concept of data locality doesn't work correctly when
> connecting through Router as the Namenode considers Router as the actual
> client and performs all the optimizations/computations based on Router's
> location rather than using the actual client location.
>
> Secondly, The Namenode Retry Cache can not be used as if in case of
> failover or such an event, the client retries again and connects to other
> router, in that case the since the Call Id is from the Router, but not from
> the actual client, the Retry Cache doesn't identify it as a repeated call
> and serves it as a whole new call which creates inconsistencies.
>
> We have been discussing and trying on solutions since a long time now and
> tried out a couple of solutions :
>
>    - Add proxy address in IPC connection (HADOOP-16254
>    <https://issues.apache.org/jira/browse/HADOOP-16254>) --> This had some
>    security concerns for Daryn.
>    - The RouterRPCServer should transfer CallerContext and client ip to
>    NamenodeRpcServer (HDFS-13293
>    <https://issues.apache.org/jira/browse/HDFS-13293>) --> This tend to be
>    little opaque and couple of more problems stated as in HDFS-13248
>    <https://issues.apache.org/jira/browse/HDFS-13248> by Ajay Kumar and
>    Arpit Agarwal
>    - Favored Nodes -->  Pass the local node as favored node. But this isn't
>    a complete solution. This doesn't take into account the fallback in
> case of
>    non availability of local nodes and couple of more. this isn't a
> solution
>    for the Retry Cache problem too.
>
>
> The related JIRA's where most of the discussion happened, if someone tends
> to follow :
> HDFS-13248 <https://issues.apache.org/jira/browse/HDFS-13248> :- For the
> DataLocality Problem. Has a patch too in the end with Solution 3(Favored
> Nodes)
> HDFS-15079 <https://issues.apache.org/jira/browse/HDFS-15079> , HDFS-15078
> <https://issues.apache.org/jira/browse/HDFS-15078> & HDFS-15310
> <https://issues.apache.org/jira/browse/HDFS-15310>  : For the Retry Cache
> Problem.
> HADOOP-16254 <https://issues.apache.org/jira/browse/HADOOP-16254> :
> Solution 1 : Add proxy address in IPC connection.
> HDFS-13293 <https://issues.apache.org/jira/browse/HDFS-13293> : Solution 2
> : Passing Caller Context.
>
> Do let us know if any help here, Any further solutions, workarounds or a
> way out to unblock or improvise the tried solutions.
>
> Thanx!!!
> -Ayush
>

Reply via email to