Hi,
My name is
Xiao, currently a PhD student in Simon Fraser University, Canada.
A little background on myself:
- My research is mainly on data management especially on NoSQL databases.
- I worked for GSoC 2008 on PostgreSQL [1] when I was an undergraduate
student:-)
-
Now
I have been working on some open source projects for one year.
They
include Apache Hive[2] and Apache Drill[3], both are SQL-on-Hadoop
engines. I've
also
played
Apache S
park for a while and have some hand-on experiences.
I am learning scala and pretty like it.
- During the period
of working on Hadoop ecosystem
, I gained experience on deploying clusters for dev and test. Docker is a
great tool for this purpose and I have been building several complex docker
containers [4].
I've heard the
great
DBpedia project long times ago and always want to play with it:-)
Given my background, I am pretty interested in the following project:
Parallel processing in DBpedia extraction Framework
[5].
Some thoughts from my initial impression and I appreciate your feedback:
- The project
uses
spark 0.9.1 while the latest version
of spark
is bumped to 1.2.1.
I suppose there will be some work on upgrade it to the new version.
- I
t looks like the process is putting the data into HDFS, using spark the
exact data and writing result back to HDFS.
Are there any design document for this project?
- Spark can works with various distributed file system (S3, GlusterFS,
etc) not limited to HDFS. So I suppose this could be configurable.
I will try it out in following days.
Any suggestions for evolving this project?
Look forward to contributing to DBpedia!
[1] https://wiki.postgresql.org/wiki/GSoC_2008
[2] https://github.com/xiaom/docker-drill
[3] https://github.com/apache/hive
[4] https://github.com/apache/drill
[5] https://github.com/dbpedia/distributed-extraction-framework
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Dbpedia-gsoc mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc