[ 
https://issues.apache.org/jira/browse/PIG-3865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmed Eldawy updated PIG-3865:
------------------------------

    Description: 
I recreated the XMLLoader in PiggyBank to work line by line instead of 
character by character. This makes it more efficient as it uses precompiled 
regular expressions on each line instead of doing checks on a character by 
character basis. The code is also significantly smaller which makes it more 
maintainable.

Just to put you in perspective. I'm a PhD student in University of Minnesota. I 
built SpatialHadoop [http://spatialhadoop.cs.umn.edu] which is an extension to 
Hadoop that adds spatial data types and indexes in HDFS. The system is open 
source and have been downloads more than 75,000 times so far. Part of it is to 
provide a simple high level language that works with spatial data.

I proposed Pigeon [http://spatialhadoop.cs.umn.edu/pigeon] as a spatial 
extension to Pig. My case study is the planet file from OpenStreetMap. This is 
a 450GB XML file that contains all the information about the whole planet. I 
previously used XMLLoader to parse it. I found some bugs and fixed it in 
previous issues. Now, I found that it takes a lot of time to parse the XML 
file. To be a good citizen, I remodeled the XMLLoader to work line by line and 
use precompiled regular expressions which makes it faster. The parsing time of 
the compressed OSM planet file drops from 5:30 hours to 3:30 hours in my 
cluster setup with Hadoop 1.2.1. By the way, Pigeon was presented in ICDE 2014 
[http://ieee-icde2014.eecs.northwestern.edu/program.html], a top conference in 
data engineering.

The code is now more maintainable. For example, I can easily modify it to add 
to accept a regular expression for the XML identifier so that it matches all 
tags that satisfy the regular expression instead of just returning a fixed 
static tag. In this version, I didn't add any new features but they can be 
added in the future.

  was:
I recreated the XMLLoader in PiggyBank to work line by line instead of 
character by character. This makes it more efficient as it uses precompiled 
regular expressions on each line instead of doing checks on a character by 
character basis. The code is also significantly smaller which makes it more 
maintainable.

Just to put you in perspective. I'm a PhD student in University of Minnesota. I 
built SpatialHadoop [http://spatialhadoop.cs.umn.edu] which is an extension to 
Hadoop that adds spatial data types and indexes in HDFS. The system is open 
source and have been downloads more than 75,000 times so far. Part of it is to 
provide a simple high level language that works with spatial data.

I proposed Pigeon [http://spatialhadoop.cs.umn.edu/pigeon] as a spatial 
extension to Pig. My case study is the planet file from OpenStreetMap. This is 
a 450GB XML file that contains all the information about the whole planet. I 
previously used XMLLoader to parse it. I found some bugs and fixed it in 
previous issues. Now, I found that it takes a lot of time to parse the XML 
file. To be a good citizen, I remodeled the XMLLoader to work line by line and 
use precompiled regular expressions which makes it faster. By the way, Pigeon 
was presented in ICDE 2014 
[http://ieee-icde2014.eecs.northwestern.edu/program.html], a top conference in 
data engineering.

The code is now more maintainable. For example, I can easily modify it to add 
to accept a regular expression for the XML identifier so that it matches all 
tags that satisfy the regular expression instead of just returning a fixed 
static tag. In this version, I didn't add any new features but they can be 
added in the future.


> Remodel the XMLLoader to work to be faster and more maintainable
> ----------------------------------------------------------------
>
>                 Key: PIG-3865
>                 URL: https://issues.apache.org/jira/browse/PIG-3865
>             Project: Pig
>          Issue Type: Improvement
>          Components: piggybank
>            Reporter: Ahmed Eldawy
>            Assignee: Ahmed Eldawy
>            Priority: Minor
>
> I recreated the XMLLoader in PiggyBank to work line by line instead of 
> character by character. This makes it more efficient as it uses precompiled 
> regular expressions on each line instead of doing checks on a character by 
> character basis. The code is also significantly smaller which makes it more 
> maintainable.
> Just to put you in perspective. I'm a PhD student in University of Minnesota. 
> I built SpatialHadoop [http://spatialhadoop.cs.umn.edu] which is an extension 
> to Hadoop that adds spatial data types and indexes in HDFS. The system is 
> open source and have been downloads more than 75,000 times so far. Part of it 
> is to provide a simple high level language that works with spatial data.
> I proposed Pigeon [http://spatialhadoop.cs.umn.edu/pigeon] as a spatial 
> extension to Pig. My case study is the planet file from OpenStreetMap. This 
> is a 450GB XML file that contains all the information about the whole planet. 
> I previously used XMLLoader to parse it. I found some bugs and fixed it in 
> previous issues. Now, I found that it takes a lot of time to parse the XML 
> file. To be a good citizen, I remodeled the XMLLoader to work line by line 
> and use precompiled regular expressions which makes it faster. The parsing 
> time of the compressed OSM planet file drops from 5:30 hours to 3:30 hours in 
> my cluster setup with Hadoop 1.2.1. By the way, Pigeon was presented in ICDE 
> 2014 [http://ieee-icde2014.eecs.northwestern.edu/program.html], a top 
> conference in data engineering.
> The code is now more maintainable. For example, I can easily modify it to add 
> to accept a regular expression for the XML identifier so that it matches all 
> tags that satisfy the regular expression instead of just returning a fixed 
> static tag. In this version, I didn't add any new features but they can be 
> added in the future.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to