[ 
https://issues.apache.org/jira/browse/HADOOP-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511213
 ] 

Owen O'Malley commented on HADOOP-1568:
---------------------------------------

I really don't see how scraping data out of html is better than parsing xml. In 
my experience, doing html scraping is fairly brittle because the dependencies 
aren't obvious to the maintainers of the server. By putting it into an xml 
format, it is very clear that formatting doesn't matter, but that attribute 
names do. You also get libraries to help you parse the xml, which you don't for 
http.

Our primary use case is distcp of a large dataset. It will kill performance to 
require the copy planner to do a http head for each file (or even worse each 
attribute). I really don't see anyway to make it perform adequately using that 
approach. Yes, the client will need to cache the metadata, but that isn't too 
hard.

> NameNode Schema for HttpFileSystem
> ----------------------------------
>
>                 Key: HADOOP-1568
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1568
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Chris Douglas
>            Assignee: Chris Douglas
>
> This issue will track the design and implementation of (the first pass of) a 
> servlet on the namenode for querying its filesystem via HTTP. The proposed 
> syntax for queries and responses is as follows.
> *Query*
> {noformat}GET http://<nn>:<port>/ls.jsp[<?option>[&option]*] 
> HTTP/1.1{noformat}
> Where _option_ may be any of the following query parameters:
> _path_ : String (default: '/')
> _recursive_ : boolean (default: false)
> _filter_ : String (default: none)
> *Response*
> The response will be returned as an XML document in the following format:
> {noformat}
> <listing path="..." recursive="(yes|no)" filter="..."
>          time="yyyy-MM-dd hh:mm:ss UTC" version="...">
>   <directory path="..."/>
>   <file path="..." modified="yyyy-MM-dd hh:mm:ss" blocksize="..."
>         replication="..." size="..."
>         dnurl="http://dn:port/streamFile?..."/>
> </listing>
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to