Hi, > It's in GraphML format and over 900MB in size. I let it run overnight > and it's still not done. The file contains email content that I don't > need - I'm really just after who sent an email to whom. Is there any > way to just read this in, and ignore the rest, that might be faster?
I would do some preprocessing on the GraphML file; in particular, remove those subtrees from the GraphML file that are within a <data key=“body”>...</data> section. Since GraphML is just plain XML, your best bet is probably some command-line XML manipulation tool. I was told that XMLStarlet (http://xmlstar.sourceforge.net/download.php) is quite good at such manipulations; I haven’t used it personally but a quick glance into its documentation shows that you can probably achieve your goal with: xmlstarlet ed -N ns=http://graphml.graphdrawing.org/xmlns -d “//ns:data[@key=‘body']” input.graphml (The above command line may not entirely be correct, but the idea is that you select all the “data” elements in the file where its “key” attribute is equal to “body” and delete those. The -N option declares the XML namespace within which the data element is to be found). Note that the start of the file downloaded from infochimps seems to have some metadata at the front; I had to skip the first 1024 bytes to get to the first XML tag. T. _______________________________________________ igraph-help mailing list [email protected] https://lists.nongnu.org/mailman/listinfo/igraph-help
