Hi Nick, thanks for the response! Yes, the idea you mention is what I was considering trying next, but I thought I'd ask in here to see if there were any other ideas about handling this with what the library has built in. I will report back here with what I do if it works out! Also, the file is about half a gig. It takes ~25 minutes to parse.
Thanks, Brendan On Friday, January 7, 2022 at 6:38:43 AM UTC-5 nichol...@surroundaustralia.com wrote: > Hi Brendan, > > This is an interesting issue! No I haven't encountered it, but then I > never use large RDF/XMl graphs. How large is your graph by the way? > > If you really think the issue is the getting or testing of elements in the > RDF DefinedNamespace, couldn't you just clone rdfxml.py and replace all > references to the RDF DefinedNamespace with references to a hard-coded set > of URIRefs? You could try using that in place of the current rdfxml.py and > see if there is a speedup. the file's only ~600 lines long, so a find 'n > replace shouldn't be too impossible. > > I would love to know how you go with this, if you try it. If it overcomes > the problem, we may consider doing such a replacement within internal > RDFlib files to improve performance and then providing the > DefinedNamespaces for external use only, i.e. when people define RDFlib > grapes with g.add() and use FOAF.givenName to represent URIs. > > Cheers, > > Nick > > On Thu, Dec 23, 2021 at 3:32 AM Brendan McMahon <brendan...@tempus.com> > wrote: > >> Dear rdflib contributors and maintainers, >> >> I have recently been trying to update rdflib to version 6 from 4.2.2. >> Upon doing so, a process I normally run, which uses rdflib to load a large >> xml RDF file into a graph, has a significantly larger memory profile and >> latency (for my large file, parsing is taking about 1.5x as much time). >> >> I've traced the issue back to the graph.parse method. More specifically, >> by profiling the graph.parse with versions 6.1.1 and 4.2.2, I can see that >> calls to access members of the RDF class (mostly occurring in the >> node_element_start >> method here >> <https://github.com/RDFLib/rdflib/blob/master/rdflib/plugins/parsers/rdfxml.py#L299> >> as >> well as the property_element_start method) seem to be taking up a >> significantly longer time, as they the class is now a DefinedNamespace with >> overridden __getitem__ and __contains >> <https://github.com/RDFLib/rdflib/blob/2011a6dd85518642e0800b2ee010a5565e16e5cc/rdflib/namespace/__init__.py#L190> >> >> methods with added string checks. >> >> Has anyone else experienced this issue? I have been trying to find ways >> to work around/with the library to lower the latency, but haven't been able >> to find anything yet. >> >> Thanks, >> Brendan >> >> ------------------------------ >> This email and any attachments may contain confidential and/or privileged >> information. If you are not the intended recipient of this message or their >> agent, or if this message has been addressed to you in error, please >> immediately alert the sender by reply email and then delete this message >> and any attachments. If you are not the intended recipient, you are hereby >> notified that any use, dissemination, copying, or storage of this message >> or its attachments is strictly prohibited. >> >> -- >> http://github.com/RDFLib >> --- >> You received this message because you are subscribed to the Google Groups >> "rdflib-dev" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to rdflib-dev+...@googlegroups.com. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/rdflib-dev/1f616ab0-187b-4af0-aca3-13d436c6dd71n%40googlegroups.com >> >> <https://groups.google.com/d/msgid/rdflib-dev/1f616ab0-187b-4af0-aca3-13d436c6dd71n%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- This email and any attachments may contain confidential and/or privileged information. If you are not the intended recipient of this message or their agent, or if this message has been addressed to you in error, please immediately alert the sender by reply email and then delete this message and any attachments. If you are not the intended recipient, you are hereby notified that any use, dissemination, copying, or storage of this message or its attachments is strictly prohibited. -- http://github.com/RDFLib --- You received this message because you are subscribed to the Google Groups "rdflib-dev" group. To unsubscribe from this group and stop receiving emails from it, send an email to rdflib-dev+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/rdflib-dev/370e6b5c-06a1-4b4e-a6bc-b9ad492d7362n%40googlegroups.com.