[jira] [Updated] (HADOOP-12666) Support Microsoft Azure Data Lake - as a file system in Hadoop

Chris Douglas (JIRA) Fri, 13 May 2016 18:34:55 -0700

     [ 
https://issues.apache.org/jira/browse/HADOOP-12666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Chris Douglas updated HADOOP-12666:
-----------------------------------
    Attachment: HADOOP-12666-012.patch

CachedRefreshTokenBasedAccessTokenProvider
- Since the AccessTokenProvider is only created by reflection, the Timer cstr 
is for testing and does not require an override in this subclass
- The static instance should be final and created during class initialization, 
but...
- {{ConfRefreshTokenBasedAccessTokenProvider}} is not threadsafe. {{setConf}} 
will update the static instance without synchronization, which is shared  by 
every instance of {{CachedRTBATP}}. This could cause undefined behavior. The 
intent is to be to pool clients with the same parameters? Would it make sense 
to add a small cache (v12)?

PrivateCachedRefreshTokenBasedAccessTokenProvider
- The override doesn't seem to serve a purpose. Since it's a workaround, adding 
audience/visibility annotations (HADOOP-5073) would emphasize that this is 
temporary.

PrivateAzureDataLakeFileSystem
- catching {{ArrayIndexOutOfBoundsException}} instead of performing proper 
bounds checking in {{BufferManager::get}} is not efficient:
{code:title=PrivateAzureDataLakeFileSystem.java}
synchronized (BufferManager.getLock()) {
  if (bm.hasData(fsPath.toString(), fileOffset, len)) {
    try {
      bm.get(data, fileOffset);
      validDataHoldingSize = data.length;
      currentFileOffset = fileOffset;
    } catch (ArrayIndexOutOfBoundsException e) {
      fetchDataOverNetwork = true;
    }
  } else {
    fetchDataOverNetwork = true;
  }
}
{code}
{code:title=BufferManager.java}
void get(byte[] data, long offset) {
  System.arraycopy(buffer.data, (int) (offset - buffer.offset), data, 0,
      data.length);
}
{code}

The BufferManager/PrivateAzureDataLakeFileSystem synchronization is unorthodox, 
and verifying its correctness is not straightforward. Layering that complexity 
on top of the readahead logic without simplifying abstractions makes it very 
difficult to review. I hope subsequent revisions will replace this code with a 
clearer model, because the current code will be very difficult to maintain.

> Support Microsoft Azure Data Lake - as a file system in Hadoop
> --------------------------------------------------------------
>
>                 Key: HADOOP-12666
>                 URL: https://issues.apache.org/jira/browse/HADOOP-12666
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: fs, fs/azure, tools
>            Reporter: Vishwajeet Dusane
>            Assignee: Vishwajeet Dusane
>         Attachments: Create_Read_Hadoop_Adl_Store_Semantics.pdf, 
> HADOOP-12666-002.patch, HADOOP-12666-003.patch, HADOOP-12666-004.patch, 
> HADOOP-12666-005.patch, HADOOP-12666-006.patch, HADOOP-12666-007.patch, 
> HADOOP-12666-008.patch, HADOOP-12666-009.patch, HADOOP-12666-010.patch, 
> HADOOP-12666-011.patch, HADOOP-12666-012.patch, HADOOP-12666-1.patch
>
>   Original Estimate: 336h
>          Time Spent: 336h
>  Remaining Estimate: 0h
>
> h2. Description
> This JIRA describes a new file system implementation for accessing Microsoft 
> Azure Data Lake Store (ADL) from within Hadoop. This would enable existing 
> Hadoop applications such has MR, HIVE, Hbase etc..,  to use ADL store as 
> input or output.
>  
> ADL is ultra-high capacity, Optimized for massive throughput with rich 
> management and security features. More details available at 
> https://azure.microsoft.com/en-us/services/data-lake-store/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HADOOP-12666) Support Microsoft Azure Data Lake - as a file system in Hadoop

Reply via email to