[jira] [Commented] (HADOOP-12666) Support Microsoft Azure Data Lake - as a file system in Hadoop

Chris Nauroth (JIRA) Fri, 26 Feb 2016 16:57:52 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-12666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15170210#comment-15170210
 ]


Chris Nauroth commented on HADOOP-12666:
----------------------------------------

If people would find it easier for line-by-line commentary, we could import the 
patch into ReviewBoard (reviews.apache.org) or a GitHub pull request.  
Reviewers, please comment if you'd like to go that way.

[~vishwajeet.dusane], thank you for updating the patch in response to feedback. 
 Here are a few more comments on patch v006.
# pom.xml should use hadoop-project as its parent.  See 
hadoop-tools/hadoop-azure/pom.xml for an example.
# artifactId should be hadoop-azure-datalake.  The resulting built jar will 
then be hadoop-azure-datalake-<VERSION>.jar.  This naming will be consistent 
with the other Hadoop build artifacts.
# Please indent pom.xml by 2 spaces.
# Thank you for the documentation!
# Would you also please update hadoop-project/src/site/site.xml, so that we get 
a hyperlink in the left nav?  Look for the "Hadoop Compatible File Systems" 
section, and you can add a link similar to the one for WASB.
# Typo: "File relication factor"
# Typo: Azure Data Lake Storage is access path syntax is
# Typo: expermental
# There are several links to this URL: 
https://azure.microsoft.com/en-in/services/data-lake-store/ .  Perhaps it's 
just my opinion, but the content there strays a little too much towards 
[marketecture|https://en.wikipedia.org/wiki/Marchitecture].  The readers of our 
documentation tend to be a technical audience that is already using Hadoop, so 
they want to find usage information and technical details as quickly as 
possible.  What do you think about linking to this other page instead, which 
seems to be the "developer hub"? 
https://azure.microsoft.com/en-in/documentation/services/data-lake-store/
# I see widespread mishandling of {{InterruptedException}} throughout the 
patch.  The classic article on how to handle {{InterruptedException}} is here: 
http://www.ibm.com/developerworks/library/j-jtp05236/ .  The bottom line is 
that it shouldn't be swallowed or rethrown as a different exception type unless 
you first restore the thread's interrupted status by calling 
{{Thread.currentThread().interrupt()}}.  Failure to restore interrupted status 
can cause unexpected behavior at other layers of the code that are expecting to 
observe the interrupted status later, such as during shutdown handling.  There 
is a lot of existing Hadoop code that doesn't do this correctly, so reading 
Hadoop code isn't always a good example.  Let's not introduce more of the same 
problems in new code though.

Now for some of the broader discussion points:

bq. We do have extended Contact test cases which integrated with back-end 
service however those test are not pushed as part of this check in. I will 
create separate JIRA for the Live mode test cases.

I would strongly prefer to have the contract tests in place from day one as 
part of this patch.  Is that possible?

We have seen multiple times that the contract test suites have been valuable 
for identifying subtle bugs in file system semantics.  It's a much more 
effective way to find and fix these kinds of bugs compared to getting multiple 
hours into a complex system integration test run with higher-level 
applications, only to have it fail because of some file system edge case.

Addiionally, it's really vital to have some kind of integration test suite 
available that really connects to the back-end service instead of using mocks.  
We have seen several times that mock-based testing has failed to expose bugs 
that surface when integrating with the real service.  Let's try to run the 
contract tests against the real service, not the mock.  The recent addition of 
the contract tests in WASB in HADOOP-12535 is a good example of how to achieve 
this.

The tests you already have in the patch are still useful, so please keep them.  
The good thing about the mock-based tests is that they'll continue to run on 
Jenkins pre-commit when people post new patches.

bq. Reason behind hiding logging through was to switch quickly between Log and 
System.out.println during debugging.  Quickest way is change the code than 
configuration file. We will migrate to use SLF4J but not part of this patch 
release.  is that fine?

Maybe I'm missing something, but I'm having a hard time justifying this.  Log4J 
makes it easy to reroute its logging to stdout via configuration using the 
ConsoleAppender.  Managing it through configuration instead of code removes the 
latency of the compile/deploy cycle when someone decides to change logging 
settings.  Hadoop developers are familiar and comfortable with Log4J 
configuration, so if they want to debug, their first instinct is going to be to 
change Log4J configuration, not change code.  Can we please switch to SLF4J and 
drop the {{ADLLogger}} class now instead of deferring it to later?

bq. ADLConfKeys#LOG_VERSION is to capture code instrumentation version. this 
information is used only during debugging session.

Can you please elaborate on what {{LOG_VERSION}} is supposed to mean?  Does 
"code instrumentation version" mean that this is trying to capture the version 
of the Hadoop software that is running, or is there some special significance 
to the hard-coded "1.2.1" value?  If it's meant to indicate the Hadoop software 
version, then the Hadoop {{VersionInfo}} class I mentioned would be a good fit.

bq. How do Hadoop folks feel about this hadoop-tools/hadoop-azure-datalake code 
declaring classes in the hadoop.hdfs.web package? I feel it needs cleanup.

Yes, I agree with [~fabbri].  Generally, it's a poor practice for multiple jars 
to bundle classes in the same Java package.  This can cause classpath 
confusion.  For example, if two jars accidentally introduce a {{Foo}} class in 
the same Java package, then the JVM might not load the one you thought it would 
load at runtime.  There are places in the Hadoop codebase where we have had to 
do this, but that was mostly for backwards-compatibility reasons, not a 
desirable design choice.

bq. Also, to clarify my earlier comment, currently properties like 
"dfs.webhdfs.oauth2.access.token.provider" are used by both WebHDFS and ADL. 
This makes it impossible to have WebHDFS and ADL configured for the same 
client, so use cases like distcp'ing from one file system to another suddenly 
become unreasonable to support.

Yes, I agree with [~mackrorysd].  This is important, because using DistCp for 
external backup to a cloud provider is a common use case.  WebHDFS (backed by 
HDFS) is often recommended for DistCp due to its cross-version compatibility 
story, so it's important for both ADL and WebHDFS backed by HDFS to co-exist in 
the same deployment.

bq. Assumption is FsInputSteam instance API would not be invoked from multiple 
threads. Are you aware of such condition which we might encounter?

Off the top of my head, I can't point out a specific application that performs 
multi-threaded reads on a single shared input stream, but I'm pretty sure this 
happens somewhere.  However, I know for sure that HDFS does provide 
thread-safety for this usage pattern, as do WASB and S3.  By failing to match 
the semantics of HDFS, this does create a potential bug.  In general, I agree 
with the comments from [~fabbri] about race conditions.

bq. We removed synchronization block for performance reason and used Volatile 
for specific variables which require synchronization. Problem could be when 
same FsInputSteam instance being used across threads to read from the same 
file. Couple of variable does not require to be volatile though, updated code.

It sounds like you attempted coarser synchronization at one point but weren't 
satisfied with the performance.  The challenge will be to provide thread-safety 
without compromising on performance.

bq. In order to achieve our common goal, I would have to file few more JIRA's 
on the `org.apache.hadoop.hdfs.web` package and work on to make extended 
FileSystem from `org.apache.hadoop.hdfs.web` configurable and refactor existing 
ADL package accordingly. I would take up this activity once the Rev 1 i.e. this 
patch set is pushed in to ASF.

Do you have a sense of how much work is required in 
{{org.apache.hadoop.hdfs.web}} to set up this patch so that all of its classes 
can reside in {{org.apache.hadoop.fs.adl}}?  Maybe we could get that done now 
as a set of pre-requisite patches before going ahead with this one.

> Support Microsoft Azure Data Lake - as a file system in Hadoop
> --------------------------------------------------------------
>
>                 Key: HADOOP-12666
>                 URL: https://issues.apache.org/jira/browse/HADOOP-12666
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: fs, fs/azure, tools
>            Reporter: Vishwajeet Dusane
>            Assignee: Vishwajeet Dusane
>         Attachments: HADOOP-12666-002.patch, HADOOP-12666-003.patch, 
> HADOOP-12666-004.patch, HADOOP-12666-005.patch, HADOOP-12666-006.patch, 
> HADOOP-12666-1.patch
>
>   Original Estimate: 336h
>          Time Spent: 336h
>  Remaining Estimate: 0h
>
> h2. Description
> This JIRA describes a new file system implementation for accessing Microsoft 
> Azure Data Lake Store (ADL) from within Hadoop. This would enable existing 
> Hadoop applications such has MR, HIVE, Hbase etc..,  to use ADL store as 
> input or output.
>  
> ADL is ultra-high capacity, Optimized for massive throughput with rich 
> management and security features. More details available at 
> https://azure.microsoft.com/en-us/services/data-lake-store/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HADOOP-12666) Support Microsoft Azure Data Lake - as a file system in Hadoop

Reply via email to