Hello guys
I am trying to create a graph that contains the relation between pages in a 
website. I have a service that crawls the website and then I create the 
relations between the pages. The current design I have is the following.

I create a node with label JOB that has the job_id. Then I create a node 
with label (STARTING_URL and URL) that has the url I started crawling from. 
Next, I keep stitching the graph. Whenever I visit a page, I create a 
relationship (VISITED) between the source node and the target node which 
are of label (URL).

Now I have some problems 
If I have two different jobs for website A, in other words website A has 
been crawled twice), the job id will be unique but the STARTING_URL node 
and other URL nodes will be updated (each node has an array attribute 
contain the job ids that this node has been part of) which is OK so far. 
Then I create the relation between source and target nodes as previous. 
There are already relationships VISITED between them but I create them 
again. Each relationship also gas a job_id attribute (I will use it later 
to filter relationship) 

When I query to see the graph of job_id = 2807. My query is something like 

MATCH p=(j:JOB)-[r:HAS]->(s:STARTING_URL)-[r1:VISITED]->(t:URL) WHERE 
j.job_id=2807 return p

When I do this query I keep getting the multiple relations between 
(s:STARTING_URL) and (t:URL) nodes. And the graph become something like 
this 

<https://lh3.googleusercontent.com/-snLMzoqqKD8/VWhhyD7fHTI/AAAAAAAAA3M/gIla3F_sYV0/s1600/Screen%2BShot%2B2015-05-29%2Bat%2B3.55.15%2BPM.png>

You can see the multiple relations. I don't want to see these multiple 
relations. I can filter on the relation r and add this condition (where 
r.job_id= 2807) and it solves the problem, but is this the best solution?? 
and how I can force all paths to start from my job_id so I should have a 
unique for each job_id 

I also have this problem. If I started a job from "http://www.w3.org"; with 
job_id=1 so this node will have property of job_id=1 and label of 
(STARTING_URL) and (URL). 
I will probably have this graph 

<https://lh3.googleusercontent.com/-DCDuGLFQ5NE/VWhjxVZ0NYI/AAAAAAAAA3U/NSNy7a_Kq_c/s1600/Screen%2BShot%2B2015-05-29%2Bat%2B4.03.51%2BPM.png>
 

If I started another job from this URL (http://w3.org). The node will have 
 job_id=2 this and it will have label of STARTING_URL and URL .. I will get 
this graph

<https://lh3.googleusercontent.com/-pX94zt2FPuc/VWhkHpSATYI/AAAAAAAAA3c/FL-ppFv62mM/s1600/Screen%2BShot%2B2015-05-29%2Bat%2B4.05.18%2BPM.png>

In one of my queries, I want to know how many nodes I have in each level. 
So If I tried to measure this value in job_id=2
I use the following query 

MATCH p=(j:JOB {job_id:2})-[r:HAS]->(s:STARTING_URL)-[r1:VISITED]->(t:URL) 
RETURN count(t)// query to get number of nodes in level1  
It should give me that level 1 only has one node, because I started from 
w3.org and was redirected to www.w3.org. However because of job_id number 1 
where I started from www.w3.org. So that node has STARTIN_URL label too. As 
a result the query will be true for the relation between w3.org --> 
www.w3.org and for all relations between www.w3.org -->(other nodes)
the level values I get [1, 21,20]  which is NOT correct. I should get 
[1,1,20] 


My question basically what do you think about this design of the graph in 
terms of performance and correctness to the real web sites. Any ideas to 
enhance the graph?? 

Regards 
Ibrahim 

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to