Hello guys I am trying to create a graph that contains the relation between pages in a website. I have a service that crawls the website and then I create the relations between the pages. The current design I have is the following.
I create a node with label JOB that has the job_id. Then I create a node with label (STARTING_URL and URL) that has the url I started crawling from. Next, I keep stitching the graph. Whenever I visit a page, I create a relationship (VISITED) between the source node and the target node which are of label (URL). Now I have some problems If I have two different jobs for website A, in other words website A has been crawled twice), the job id will be unique but the STARTING_URL node and other URL nodes will be updated (each node has an array attribute contain the job ids that this node has been part of) which is OK so far. Then I create the relation between source and target nodes as previous. There are already relationships VISITED between them but I create them again. Each relationship also gas a job_id attribute (I will use it later to filter relationship) When I query to see the graph of job_id = 2807. My query is something like MATCH p=(j:JOB)-[r:HAS]->(s:STARTING_URL)-[r1:VISITED]->(t:URL) WHERE j.job_id=2807 return p When I do this query I keep getting the multiple relations between (s:STARTING_URL) and (t:URL) nodes. And the graph become something like this <https://lh3.googleusercontent.com/-snLMzoqqKD8/VWhhyD7fHTI/AAAAAAAAA3M/gIla3F_sYV0/s1600/Screen%2BShot%2B2015-05-29%2Bat%2B3.55.15%2BPM.png> You can see the multiple relations. I don't want to see these multiple relations. I can filter on the relation r and add this condition (where r.job_id= 2807) and it solves the problem, but is this the best solution?? and how I can force all paths to start from my job_id so I should have a unique for each job_id I also have this problem. If I started a job from "http://www.w3.org" with job_id=1 so this node will have property of job_id=1 and label of (STARTING_URL) and (URL). I will probably have this graph <https://lh3.googleusercontent.com/-DCDuGLFQ5NE/VWhjxVZ0NYI/AAAAAAAAA3U/NSNy7a_Kq_c/s1600/Screen%2BShot%2B2015-05-29%2Bat%2B4.03.51%2BPM.png> If I started another job from this URL (http://w3.org). The node will have job_id=2 this and it will have label of STARTING_URL and URL .. I will get this graph <https://lh3.googleusercontent.com/-pX94zt2FPuc/VWhkHpSATYI/AAAAAAAAA3c/FL-ppFv62mM/s1600/Screen%2BShot%2B2015-05-29%2Bat%2B4.05.18%2BPM.png> In one of my queries, I want to know how many nodes I have in each level. So If I tried to measure this value in job_id=2 I use the following query MATCH p=(j:JOB {job_id:2})-[r:HAS]->(s:STARTING_URL)-[r1:VISITED]->(t:URL) RETURN count(t)// query to get number of nodes in level1 It should give me that level 1 only has one node, because I started from w3.org and was redirected to www.w3.org. However because of job_id number 1 where I started from www.w3.org. So that node has STARTIN_URL label too. As a result the query will be true for the relation between w3.org --> www.w3.org and for all relations between www.w3.org -->(other nodes) the level values I get [1, 21,20] which is NOT correct. I should get [1,1,20] My question basically what do you think about this design of the graph in terms of performance and correctness to the real web sites. Any ideas to enhance the graph?? Regards Ibrahim -- You received this message because you are subscribed to the Google Groups "Neo4j" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
