Hi,

There is an apache project for that : nutch (https://nutch.apache.org/)
You can do a plugin for gora (https://gora.apache.org/) that save data into neo4j.

Cheers

Le 20/11/2014 01:46, Michael Hunger a écrit :
Probably not so good because you want to run the crawler multi-threaded across a lot of network connections and this would affect Neo4j's performance (also in terms of GC).

Probably easier to use a message queue to send crawled pages to a neo4j extension and then let the extension run the graph algorithms you want to use to integrate the crawling results best into your graph.

HTH Michael

On Wed, Nov 19, 2014 at 6:45 PM, Pedro Montoto García <[email protected] <mailto:[email protected]>> wrote:

    Considering the situation of implementing a domain-specific web
    crawler I've come across a number of technologies, but I had an
    idea to implement it as a server extension in neo4j.

    The idea would be to use the graph database to implement the
    concepts of "already explored pages" and "frontier" as server-side
    algorithms and use them to feed the crawling algorithm but, as you
    see, you can go an step further and implement the crawling in the
    server side too. Could this be a bad idea? If so, why?
-- You received this message because you are subscribed to the Google
    Groups "Neo4j" group.
    To unsubscribe from this group and stop receiving emails from it,
    send an email to [email protected]
    <mailto:[email protected]>.
    For more options, visit https://groups.google.com/d/optout.


--
You received this message because you are subscribed to the Google Groups "Neo4j" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected] <mailto:[email protected]>.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to