Re: [Dbpedia-gsoc] GSoC 2015 Introduction and Parallel processing in DBpedia extraction Framework

Dimitris Kontokostas Mon, 16 Mar 2015 00:24:07 -0700

On Mon, Mar 16, 2015 at 8:59 AM, Navin Pai <lifeofna...@gmail.com> wrote:


> Hi Dimitris,
> Thanks for clarifying the thought process behind the project. Correct me
> if I'm wrong but what we're aiming for is to allow people to play with
> dbpedia projects using a single straightforward 'docker run' or
> 'spark-submit' command right?
>

I wish it could be that simple :) but the idea is to make it as
straightforward as possible


> Spark announced it's 1.3.0 release a couple of days ago[1]. I have a
> single node cluster running Hadoop 2.6 and Spark 1.2.1. I'll upgrade my
> version of Spark and try to port the code to the newer versions. I'm hoping
> there won't be too many roadblocks. I'll keep the mailing list updated on
> how it goes :)
>

The idea behind this task is to get you familiar with the code and help you
write a better application
We don't expect the upgrade to be a successful warm-up task, you could use
v1.2.1 as well or any other version.


Cheers,
Dimitris

>
> [1] https://spark.apache.org/news/spark-1-3-0-released.html
>
> On Tue, Mar 10, 2015 at 3:58 PM, Dimitris Kontokostas <jimk...@gmail.com>
> wrote:
>
>> Hi Navin,
>>
>> On Sun, Mar 8, 2015 at 1:46 PM, Navin Pai <lifeofna...@gmail.com> wrote:
>>
>>> Yup, looking at the changelog of Apache Spark and having worked on
>>> upgrading much smaller applications across Spark versions, I can attest
>>> that this process shouldn't take too much time. The number of breaking
>>> changes are very minimal in recent versions.
>>>
>>
>> Maybe this could be a warm up task for this project
>>
>>
>>> An idea I had, which I would like feedback on is having a configuration
>>> picker, rather than a list of preconfigured container/images. Kind of along
>>> the lines of Fedora's Revisor project [1]. You could mix and match
>>> depending on the configuration you want to use and a customized
>>> image/container is created for you. Of course, the feasibility of this is
>>> an open question...
>>>
>>
>> This sounds like a good idea but I would put a lower priority in this and
>> try it if there is time left at the end of the projet
>>
>>
>>> Honestly, if you ask me, this one project could probably be broken up
>>> into multiple projects, each with a different end goal. Docker brings in a
>>> very interesting set of things to play with, and it would be great if some
>>> of the mentors could provide more feedback on what the end goal of this
>>> specific GSoC project is. :)
>>>
>>
>> We are trying to bring DBpedia closer to industry related / big data
>> projects and the preconfigured images or easy configurable scripts are a
>> step towards industry adoption. So the idea is to give people tools to
>> easily experiment with the code & data and see if they can invest more time
>> to port it in their software stack.
>> Another goal it to make it easy to run on a single-node cluster. Some
>> preliminary results from Nilesh showed a big boost in extraction time even
>> on a single machine due to better utilization of the HD so this could speed
>> up our static releases.
>>
>> Best,
>> Dimitris
>>
>>
>>>
>>> Thanks
>>>
>>> [1] http://revisor.fedoraunity.org/
>>>
>>>
>>>
>>> Hi Xiao, and welcome!
>>>>
>>>> Some thoughts from my initial impression and I appreciate your feedback:
>>>> > ?- The project ?uses?? ?spark 0.9.1 while the latest version? of
>>>> spark? is
>>>> > bumped to 1.2.1.? I suppose there will be some work on upgrade it to
>>>> the
>>>> > new version.?
>>>> >
>>>>
>>>> It'll perhaps be good to port the code to Spark 1.2.1; I can't imagine
>>>> it'll take too much work because the Spark API has been pretty stable
>>>> since
>>>> that.
>>>>
>>>>
>>>> > - It looks like the process is putting the data into HDFS, using
>>>> spark the
>>>> > exact data and writing result back to HDFS. ?Are there any design
>>>> document
>>>> > for this project?
>>>> >
>>>>
>>>> Yes, but it can also work without HDFS. On a single-node cluster you can
>>>> write directly
>>>
>>> to the file system (I'm not sure if there is enough
>>>> documentation on that, but there should be; it's mostly about
>>>> substituting
>>>> hdfs:///home/user/blah with file:///home/user/blah). On a multi-node
>>>> cluster with NFS you can also work without HDFS.
>>>>
>>>> I have been meaning to write a proper paper on the project since a few
>>>> months but never managed to get around to it.
>>>>
>>>> - Spark can works with various distributed file system (S3, GlusterFS,
>>>> etc)
>>>> > not limited to HDFS. So I suppose this could be configurable.
>>>>
>>>>
>>>> It'd be a good idea to make this configurable, and I suppose it fits in
>>>> well with the docker containers idea too. Different kinds of
>>>> configurations
>>>> for EC2/S3, Google Cloud etc.
>>>>
>>>> Feel free to ask any other questions that you may have while running it.
>>>>
>>>> Cheers,
>>>> Nilesh
>>>>
>>>> You can also email me at cont...@nileshc.com or visit my website
>>>> <http://nileshc.com/>
>>>>
>>>>
>>>> On Thu, Mar 5, 2015 at 8:27 PM, Xiao Meng <xiaom...@gmail.com> wrote:
>>>>
>>>> > Hi,
>>>> >
>>>> > My name is?
>>>> >  Xiao, currently a PhD student in Simon Fraser University, Canada.
>>>> > ?
>>>> >
>>>> >
>>>> > A little background on myself:
>>>> >
>>>> > - My research is mainly on data management especially on NoSQL
>>>> databases.
>>>> > - I worked for GSoC 2008 on PostgreSQL [1] when I was an undergraduate
>>>> > student:-)
>>>> > -
>>>> > ?Now ?
>>>> > I have been working on some open source projects for one year.
>>>> > ?They?
>>>> >  include Apache Hive[2] and Apache Drill[3], both are SQL-on-Hadoop
>>>> > engines. I've
>>>> > ?also ?
>>>> > played
>>>> > ?Apache S?
>>>> > park for a while and have some hand-on experiences.
>>>> > ?I am learning scala and pretty like it.?
>>>> >
>>>> > - During the period
>>>> > ? of working on Hadoop ecosystem?
>>>> > , I gained experience on deploying clusters for dev and test. Docker
>>>> is a
>>>> > great tool for this purpose and I have been building several complex
>>>> docker
>>>> > containers [4].
>>>> >
>>>> > I've heard the
>>>> > ?great
>>>> >  DBpedia project long times ago and always want to play with it:-)
>>>> >
>>>> > Given my background,  I am pretty interested in the following project:
>>>> > ? ?
>>>> > Parallel processing in DBpedia extraction Framework
>>>> > ?[5]?.
>>>> >
>>>> >
>>>> > Some thoughts from my initial impression and I appreciate your
>>>> feedback:
>>>> >
>>>> > ?- The project ?
>>>> > uses?
>>>> > ? ?
>>>> > spark 0.9.1 while the latest version
>>>> > ? of spark?
>>>> > is bumped to 1.2.1.
>>>> > ?
>>>> > I suppose there will be some work on upgrade it to the new version.
>>>> > ?
>>>> > - I
>>>> > t looks like the process is putting the data into HDFS, using spark
>>>> the
>>>> > exact data and writing result back to HDFS.
>>>> > ?
>>>> > Are there any design document for this project?
>>>> >     - Spark can works with various distributed file system (S3,
>>>> GlusterFS,
>>>> > etc) not limited to HDFS. So I suppose this could be configurable.
>>>> >
>>>> > ?I will try it out in following days.
>>>> > ? Any suggestions for evolving this project?
>>>> > ?
>>>> >
>>>> > ?Look forward to contributing to DBpedia!
>>>> >
>>>> >
>>>> > [1] https://wiki.postgresql.org/wiki/GSoC_2008
>>>> > [2] https://github.com/xiaom/docker-drill
>>>> > [3] https://github.com/apache/hive
>>>> > [4] https://github.com/apache/drill
>>>> > [5] https://github.com/dbpedia/distributed-extraction-framework
>>>>
>>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Dive into the World of Parallel Programming The Go Parallel Website,
>>> sponsored
>>> by Intel and developed in partnership with Slashdot Media, is your hub
>>> for all
>>> things parallel software development, from weekly thought leadership
>>> blogs to
>>> news, videos, case studies, tutorials and more. Take a look and join the
>>> conversation now. http://goparallel.sourceforge.net/
>>> _______________________________________________
>>> Dbpedia-gsoc mailing list
>>> Dbpedia-gsoc@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>>>
>>>
>>
>>
>> --
>> Kontokostas Dimitris
>>
>
>


-- 
Kontokostas Dimitris

------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/

_______________________________________________
Dbpedia-gsoc mailing list
Dbpedia-gsoc@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc

Re: [Dbpedia-gsoc] GSoC 2015 Introduction and Parallel processing in DBpedia extraction Framework

Reply via email to