[
https://issues.apache.org/jira/browse/HADOOP-12666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15170210#comment-15170210
]
Chris Nauroth commented on HADOOP-12666:
----------------------------------------
If people would find it easier for line-by-line commentary, we could import the
patch into ReviewBoard (reviews.apache.org) or a GitHub pull request.
Reviewers, please comment if you'd like to go that way.
[~vishwajeet.dusane], thank you for updating the patch in response to feedback.
Here are a few more comments on patch v006.
# pom.xml should use hadoop-project as its parent. See
hadoop-tools/hadoop-azure/pom.xml for an example.
# artifactId should be hadoop-azure-datalake. The resulting built jar will
then be hadoop-azure-datalake-<VERSION>.jar. This naming will be consistent
with the other Hadoop build artifacts.
# Please indent pom.xml by 2 spaces.
# Thank you for the documentation!
# Would you also please update hadoop-project/src/site/site.xml, so that we get
a hyperlink in the left nav? Look for the "Hadoop Compatible File Systems"
section, and you can add a link similar to the one for WASB.
# Typo: "File relication factor"
# Typo: Azure Data Lake Storage is access path syntax is
# Typo: expermental
# There are several links to this URL:
https://azure.microsoft.com/en-in/services/data-lake-store/ . Perhaps it's
just my opinion, but the content there strays a little too much towards
[marketecture|https://en.wikipedia.org/wiki/Marchitecture]. The readers of our
documentation tend to be a technical audience that is already using Hadoop, so
they want to find usage information and technical details as quickly as
possible. What do you think about linking to this other page instead, which
seems to be the "developer hub"?
https://azure.microsoft.com/en-in/documentation/services/data-lake-store/
# I see widespread mishandling of {{InterruptedException}} throughout the
patch. The classic article on how to handle {{InterruptedException}} is here:
http://www.ibm.com/developerworks/library/j-jtp05236/ . The bottom line is
that it shouldn't be swallowed or rethrown as a different exception type unless
you first restore the thread's interrupted status by calling
{{Thread.currentThread().interrupt()}}. Failure to restore interrupted status
can cause unexpected behavior at other layers of the code that are expecting to
observe the interrupted status later, such as during shutdown handling. There
is a lot of existing Hadoop code that doesn't do this correctly, so reading
Hadoop code isn't always a good example. Let's not introduce more of the same
problems in new code though.
Now for some of the broader discussion points:
bq. We do have extended Contact test cases which integrated with back-end
service however those test are not pushed as part of this check in. I will
create separate JIRA for the Live mode test cases.
I would strongly prefer to have the contract tests in place from day one as
part of this patch. Is that possible?
We have seen multiple times that the contract test suites have been valuable
for identifying subtle bugs in file system semantics. It's a much more
effective way to find and fix these kinds of bugs compared to getting multiple
hours into a complex system integration test run with higher-level
applications, only to have it fail because of some file system edge case.
Addiionally, it's really vital to have some kind of integration test suite
available that really connects to the back-end service instead of using mocks.
We have seen several times that mock-based testing has failed to expose bugs
that surface when integrating with the real service. Let's try to run the
contract tests against the real service, not the mock. The recent addition of
the contract tests in WASB in HADOOP-12535 is a good example of how to achieve
this.
The tests you already have in the patch are still useful, so please keep them.
The good thing about the mock-based tests is that they'll continue to run on
Jenkins pre-commit when people post new patches.
bq. Reason behind hiding logging through was to switch quickly between Log and
System.out.println during debugging. Quickest way is change the code than
configuration file. We will migrate to use SLF4J but not part of this patch
release. is that fine?
Maybe I'm missing something, but I'm having a hard time justifying this. Log4J
makes it easy to reroute its logging to stdout via configuration using the
ConsoleAppender. Managing it through configuration instead of code removes the
latency of the compile/deploy cycle when someone decides to change logging
settings. Hadoop developers are familiar and comfortable with Log4J
configuration, so if they want to debug, their first instinct is going to be to
change Log4J configuration, not change code. Can we please switch to SLF4J and
drop the {{ADLLogger}} class now instead of deferring it to later?
bq. ADLConfKeys#LOG_VERSION is to capture code instrumentation version. this
information is used only during debugging session.
Can you please elaborate on what {{LOG_VERSION}} is supposed to mean? Does
"code instrumentation version" mean that this is trying to capture the version
of the Hadoop software that is running, or is there some special significance
to the hard-coded "1.2.1" value? If it's meant to indicate the Hadoop software
version, then the Hadoop {{VersionInfo}} class I mentioned would be a good fit.
bq. How do Hadoop folks feel about this hadoop-tools/hadoop-azure-datalake code
declaring classes in the hadoop.hdfs.web package? I feel it needs cleanup.
Yes, I agree with [~fabbri]. Generally, it's a poor practice for multiple jars
to bundle classes in the same Java package. This can cause classpath
confusion. For example, if two jars accidentally introduce a {{Foo}} class in
the same Java package, then the JVM might not load the one you thought it would
load at runtime. There are places in the Hadoop codebase where we have had to
do this, but that was mostly for backwards-compatibility reasons, not a
desirable design choice.
bq. Also, to clarify my earlier comment, currently properties like
"dfs.webhdfs.oauth2.access.token.provider" are used by both WebHDFS and ADL.
This makes it impossible to have WebHDFS and ADL configured for the same
client, so use cases like distcp'ing from one file system to another suddenly
become unreasonable to support.
Yes, I agree with [~mackrorysd]. This is important, because using DistCp for
external backup to a cloud provider is a common use case. WebHDFS (backed by
HDFS) is often recommended for DistCp due to its cross-version compatibility
story, so it's important for both ADL and WebHDFS backed by HDFS to co-exist in
the same deployment.
bq. Assumption is FsInputSteam instance API would not be invoked from multiple
threads. Are you aware of such condition which we might encounter?
Off the top of my head, I can't point out a specific application that performs
multi-threaded reads on a single shared input stream, but I'm pretty sure this
happens somewhere. However, I know for sure that HDFS does provide
thread-safety for this usage pattern, as do WASB and S3. By failing to match
the semantics of HDFS, this does create a potential bug. In general, I agree
with the comments from [~fabbri] about race conditions.
bq. We removed synchronization block for performance reason and used Volatile
for specific variables which require synchronization. Problem could be when
same FsInputSteam instance being used across threads to read from the same
file. Couple of variable does not require to be volatile though, updated code.
It sounds like you attempted coarser synchronization at one point but weren't
satisfied with the performance. The challenge will be to provide thread-safety
without compromising on performance.
bq. In order to achieve our common goal, I would have to file few more JIRA's
on the `org.apache.hadoop.hdfs.web` package and work on to make extended
FileSystem from `org.apache.hadoop.hdfs.web` configurable and refactor existing
ADL package accordingly. I would take up this activity once the Rev 1 i.e. this
patch set is pushed in to ASF.
Do you have a sense of how much work is required in
{{org.apache.hadoop.hdfs.web}} to set up this patch so that all of its classes
can reside in {{org.apache.hadoop.fs.adl}}? Maybe we could get that done now
as a set of pre-requisite patches before going ahead with this one.
> Support Microsoft Azure Data Lake - as a file system in Hadoop
> --------------------------------------------------------------
>
> Key: HADOOP-12666
> URL: https://issues.apache.org/jira/browse/HADOOP-12666
> Project: Hadoop Common
> Issue Type: New Feature
> Components: fs, fs/azure, tools
> Reporter: Vishwajeet Dusane
> Assignee: Vishwajeet Dusane
> Attachments: HADOOP-12666-002.patch, HADOOP-12666-003.patch,
> HADOOP-12666-004.patch, HADOOP-12666-005.patch, HADOOP-12666-006.patch,
> HADOOP-12666-1.patch
>
> Original Estimate: 336h
> Time Spent: 336h
> Remaining Estimate: 0h
>
> h2. Description
> This JIRA describes a new file system implementation for accessing Microsoft
> Azure Data Lake Store (ADL) from within Hadoop. This would enable existing
> Hadoop applications such has MR, HIVE, Hbase etc.., to use ADL store as
> input or output.
>
> ADL is ultra-high capacity, Optimized for massive throughput with rich
> management and security features. More details available at
> https://azure.microsoft.com/en-us/services/data-lake-store/
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)