Hi Xiao, and welcome!
Some thoughts from my initial impression and I appreciate your feedback:
> - The project uses spark 0.9.1 while the latest version of spark is
> bumped to 1.2.1. I suppose there will be some work on upgrade it to the
> new version.
>
It'll perhaps be good to port the code to Spark 1.2.1; I can't imagine
it'll take too much work because the Spark API has been pretty stable since
that.
> - It looks like the process is putting the data into HDFS, using spark the
> exact data and writing result back to HDFS. Are there any design document
> for this project?
>
Yes, but it can also work without HDFS. On a single-node cluster you can
write directly to the file system (I'm not sure if there is enough
documentation on that, but there should be; it's mostly about substituting
hdfs:///home/user/blah with file:///home/user/blah). On a multi-node
cluster with NFS you can also work without HDFS.
I have been meaning to write a proper paper on the project since a few
months but never managed to get around to it.
- Spark can works with various distributed file system (S3, GlusterFS, etc)
> not limited to HDFS. So I suppose this could be configurable.
It'd be a good idea to make this configurable, and I suppose it fits in
well with the docker containers idea too. Different kinds of configurations
for EC2/S3, Google Cloud etc.
Feel free to ask any other questions that you may have while running it.
Cheers,
Nilesh
You can also email me at [email protected] or visit my website
<http://nileshc.com/>
On Thu, Mar 5, 2015 at 8:27 PM, Xiao Meng <[email protected]> wrote:
> Hi,
>
> My name is
> Xiao, currently a PhD student in Simon Fraser University, Canada.
>
>
>
> A little background on myself:
>
> - My research is mainly on data management especially on NoSQL databases.
> - I worked for GSoC 2008 on PostgreSQL [1] when I was an undergraduate
> student:-)
> -
> Now
> I have been working on some open source projects for one year.
> They
> include Apache Hive[2] and Apache Drill[3], both are SQL-on-Hadoop
> engines. I've
> also
> played
> Apache S
> park for a while and have some hand-on experiences.
> I am learning scala and pretty like it.
>
> - During the period
> of working on Hadoop ecosystem
> , I gained experience on deploying clusters for dev and test. Docker is a
> great tool for this purpose and I have been building several complex docker
> containers [4].
>
> I've heard the
> great
> DBpedia project long times ago and always want to play with it:-)
>
> Given my background, I am pretty interested in the following project:
>
> Parallel processing in DBpedia extraction Framework
> [5].
>
>
> Some thoughts from my initial impression and I appreciate your feedback:
>
> - The project
> uses
>
> spark 0.9.1 while the latest version
> of spark
> is bumped to 1.2.1.
>
> I suppose there will be some work on upgrade it to the new version.
>
> - I
> t looks like the process is putting the data into HDFS, using spark the
> exact data and writing result back to HDFS.
>
> Are there any design document for this project?
> - Spark can works with various distributed file system (S3, GlusterFS,
> etc) not limited to HDFS. So I suppose this could be configurable.
>
> I will try it out in following days.
> Any suggestions for evolving this project?
>
>
> Look forward to contributing to DBpedia!
>
>
> [1] https://wiki.postgresql.org/wiki/GSoC_2008
> [2] https://github.com/xiaom/docker-drill
> [3] https://github.com/apache/hive
> [4] https://github.com/apache/drill
> [5] https://github.com/dbpedia/distributed-extraction-framework
>
>
> ------------------------------------------------------------------------------
> Dive into the World of Parallel Programming The Go Parallel Website,
> sponsored
> by Intel and developed in partnership with Slashdot Media, is your hub for
> all
> things parallel software development, from weekly thought leadership blogs
> to
> news, videos, case studies, tutorials and more. Take a look and join the
> conversation now. http://goparallel.sourceforge.net/
> _______________________________________________
> Dbpedia-gsoc mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>
>
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Dbpedia-gsoc mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc