Hi Pratyush,
Please ask all questions on the mailing list during this phase, and not in
private, if you want to get a reply.
Please clarify in what ways we use Wikipedia ClickStream info into Dbpedia.
The clickstream info will be mainly used to produce a graph structure of
the user behaviour in wikipedia.
We will convert each prev_id and curr_id to a DBpedia resource, and assign
a directed edge from prev_id to curr_id . We will load the entire dataset
into Apache Spark GraphX.
We will then compute the indegree, outdegree, hits and pagerank of each
node.
None of these fields will be used as ontology classes. You will instead
create new fileds called:
csInDegree, csOutDegree, csPageRank, csHits etc.
And, what types of graph measures we should take?
Let me pass the ball back to you, what types of graph measures do you think
would be useful. Have you looked into Spark which ones are pre-implemented
?
I also commented on your proposal.
Cheers,
Alexandru
On Mon, Mar 21, 2016 at 6:14 PM, Pratyush Kumar <[email protected]>
wrote:
> Hi Alexandru,
> Please clarify in what ways we use Wikipedia ClickStream info into Dbpedia.
> First, we extract wikipedia pageview dump file. The data includes six
> fields: prev_id, curr_id, n, prev_title, curr_title, type.
> Out of these fields, how many we use as a property for ontology classes.
> I must have to clarify these issues so that i can make proper architecture
> of clickstream info.
> And, what types of graph measures we should take?
> Please reply.
> Thanks
>
>
>
> Sent with MailTrack
> <https://mailtrack.io/install?source=signature&lang=en&[email protected]&idSignature=22>
>
> On Sun, Mar 20, 2016 at 12:29 AM, Pratyush Kumar <[email protected]>
> wrote:
>
>> Hi Alexandru,
>> I have some queries regarding Click Stream info i.e., in what ways we
>> incorporate the wikipedia clickstream info into dbpedia.
>> Should we add the path of "what next is clicked" on every article.
>> And for word, entity frequencies can easily be find out using spark
>> count() action.
>> For edit frequencies should we add a new property on all ontology class
>> regarding how many edits are done on particular article.
>> What types of graph measure should we take?
>> I have started writing my proposal and i need your help for more
>> clarification regarding these issues.
>> Please reply asap.
>>
>> Thanks
>>
>>
>>
>> Sent with MailTrack
>> <https://mailtrack.io/install?source=signature&lang=en&[email protected]&idSignature=22>
>>
>> On Fri, Mar 18, 2016 at 7:13 PM, Pratyush Kumar <[email protected]>
>> wrote:
>>
>>> Hi Alexandru,
>>> Thanks for the clarification regarding all my queries.
>>> I am familiar with java/scala and learning spark. Please, give me some
>>> more warm-up tasks so that i become more familiar regarding the scope of
>>> the project and understand it deeply.
>>>
>>> Thanks
>>>
>>>
>>>
>>> Sent with MailTrack
>>> <https://mailtrack.io/install?source=signature&lang=en&[email protected]&idSignature=22>
>>>
>>> On Fri, Mar 18, 2016 at 6:57 PM, Alexandru Todor <[email protected]
>>> > wrote:
>>>
>>>> Hi Pratyush,
>>>>
>>>> Welcome to the DBpedia GSoC mailing list,
>>>>
>>>> I will try to answer your questions:
>>>>
>>>> Do we add page visit count in DBpedia Ontology and synchronized it with
>>>>> DBpedia live?
>>>>
>>>>
>>>> Yes we should add a new property to each DBpedia Resource called
>>>> wikipediaVisitCount or similar. This property will be hard coded for the
>>>> time being, since it needs to be present for all Ontology classes and not
>>>> for a specific one. The exact way we should represent the new property
>>>> should be discussed on the dbpedia-ontology mailing list
>>>>
>>>> We can not synchronize this for now with DBpedia Live, since there is
>>>> no update stream for the page counts as far as I know. We will automate the
>>>> extraction so that it periodically polls the Wikipedia servers, and
>>>> executes an extraction automatically when new information is available. We
>>>> could then feed the output to DBpedia Live or any other endpoint, or just
>>>> make the extracted dumps available
>>>>
>>>> Since, we have to extract page visit and click stream info from
>>>>> Wikipedia. As, page visit counts are available in Wikimedia dumps
>>>>> https://dumps.wikimedia.org/other/pagecounts-all-sites/ . So, can we
>>>>> extract these dumps? and add it where we want. I can't able to open these
>>>>> dumps because it takes lots of time to download.
>>>>
>>>>
>>>> Yes, we download the dumps from Wikipedia. You don't need to download
>>>> the entire dumps, just a couple of files are enough. You need to make your
>>>> code be able to work with the full dump contents. I will execute the code
>>>> for you on the full dumps.
>>>>
>>>> If you are familiar with Java and/or Scala and the corresponding Apache
>>>> Spark API I can proceed to give you more serious warm-up tasks, that will
>>>> make you better understand the scope of the project.
>>>>
>>>> Please help me on these issues and from where we can take help of
>>>>> mentors to write proposal.
>>>>>
>>>>
>>>> I am the mentor for this project and will help you write the Proposal.
>>>> You can ask questions here or on the DBpedia ideas website.
>>>>
>>>> Cheers,
>>>> Alexandru
>>>>
>>>>
>>>> On Wed, Mar 16, 2016 at 7:49 AM, Pratyush Kumar <[email protected]
>>>> > wrote:
>>>>
>>>>> Hi all,
>>>>> I am Pratyush Kumar and I'm B.Tech student from IIT Roorkee, India.
>>>>> I am interested in DBpedia project : Derived/Extra WikiPage
>>>>> Information Extractor. For this, i have done all the warm up and
>>>>> recommended tasks. I have good knowledge in Java, Scala and I am learning
>>>>> Apache Spark.
>>>>> My certain queries are:
>>>>> Do we add page visit count in DBpedia Ontology and synchronized it
>>>>> with DBpedia live?
>>>>> Since, we have to extract page visit and click stream info from
>>>>> Wikipedia. As, page visit counts are available in Wikimedia dumps
>>>>> https://dumps.wikimedia.org/other/pagecounts-all-sites/ . So, can we
>>>>> extract these dumps? and add it where we want. I can't able to open these
>>>>> dumps because it takes lots of time to download.
>>>>> Please help me on these issues and from where we can take help of
>>>>> mentors to write proposal.
>>>>> Thanks.
>>>>>
>>>>>
>>>>> --
>>>>> With Regards,
>>>>> Pratyush Kumar
>>>>> IIT Roorkee
>>>>> Mo: +91-7895395395
>>>>>
>>>>>
>>>>>
>>>>> Sent with MailTrack
>>>>> <https://mailtrack.io/install?source=signature&lang=en&[email protected]&idSignature=22>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>> Transform Data into Opportunity.
>>>>> Accelerate data analysis in your applications with
>>>>> Intel Data Analytics Acceleration Library.
>>>>> Click to learn more.
>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=278785231&iu=/4140
>>>>> _______________________________________________
>>>>> Dbpedia-gsoc mailing list
>>>>> [email protected]
>>>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> With Regards,
>>> Pratyush Kumar
>>> IIT Roorkee
>>> Mo: +91-7895395395
>>>
>>
>>
>>
>> --
>> With Regards,
>> Pratyush Kumar
>> IIT Roorkee
>> Mo: +91-7895395395
>>
>
>
>
> --
> With Regards,
> Pratyush Kumar
> IIT Roorkee
> Mo: +91-7895395395
>
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
_______________________________________________
Dbpedia-gsoc mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc