On 24.05.2017 04:27, B wrote:
> The problem I'm having is that the pandas library can't even finish reading
> my csv file, even after 3 hours, on the largest AWS instance.  
> 
> I know this isn't related to pandas but I was wondering if there was a
> better way to prepare the data to send over to graph tool that could work at
> my scale?  Is there anything I can do to read the csv file, calculate
> pagerank, and output all the domain vertices with their respective pagerank
> scores?
> 
> I'm scared my only option is using spark in some distributed fashion ... If
> that is the case, how do I still get the edges data as integers into graph
> tool anyways?


I think the simplest approach is to drop pandas completely, and work
with the file directly. You should avoid loading the entire file into
memory, and instead use the iterator interface of Python's csv
module. As the edges are read, you process them and add them to your
graph one by one.

If you look closely, you will see that graph-tool already provides a
"load_graph_from_csv" function:


https://graph-tool.skewed.de/static/doc/graph_tool.html#graph_tool.load_graph_from_csv

This automates this process and performs some basic processing like
hashing the vertex names. You can create some intermediary iterator that
converts things to lower case.

Now, for a file of size 600 gb, this will still be quite slow. Maybe you
should take a look at some fast CSV parsers out there, e.g.:

    http://www.wise.io/tech/paratext

Best,
Tiago

-- 
Tiago de Paula Peixoto <[email protected]>

Attachment: signature.asc
Description: OpenPGP digital signature

_______________________________________________
graph-tool mailing list
[email protected]
https://lists.skewed.de/mailman/listinfo/graph-tool

Reply via email to