there seem to be lots of zip implementations - another one for instance is 7-zip. I haven't studied this issue enough to have a real opinion, but if zips are implemented, I wonder if it would be good to have some kind of a pluggable mechanism to allow for different zips for different circumstances.
-Marshall On 7/14/2011 4:32 PM, Greg Holmberg (JIRA) wrote: > [ > https://issues.apache.org/jira/browse/UIMA-2128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13065516#comment-13065516 > ] > > Greg Holmberg commented on UIMA-2128: > ------------------------------------- > > I'm not sure where in the code it should be implemented (which seems to be > Jörn's point), but another technology option is EXI, a binary encoding for > XML. See http://www.w3.org/XML/EXI > > I've experimented with this to encode XMI between a UIMA process and a > database on separate machine (both ends Java). I used a commercial > implementation, Efficient XML from AgileDelta, and did some throughput > measurements, comparing it to gzipped XMI XML. I found that EXI produces > somewhat smaller data than gzipped XML text (maybe 10% or 20% smaller, if I > remember correctly). The biggest benefit to EXI was the amount of CPU time > required to read and write. It was quite a bit faster than gzip to generate > the XML and parse the XML. This is probably because it does so directly from > the ContentHandler, whereas with gzipped text, you first have to write the > text and then compress it. Also, I suppose it's just more efficient to parse > a binary format than to step through characters looking for certain tokens. > > In my case, the improved throughput and reduction of CPU usage was most > important on the receiving end (i.e. the database) since it is a central > bottle-neck in the overall landscape of my system. As the number of UIMA > senders increases, the database reaches it's limits to handle more messages > (no more CPU or NIC capacity on that machine). So it was important to me to > be as efficient as possible with the XMI parsing on that machine in order to > minimize my hardware costs. > > In my case, all annotators are local, but a similar bottle-neck situation > could arise in UIMA AS if you have a remote annotator (service). Then, > making that UIMA processor as efficient as possible in terms of both CPU and > network bandwidth usage becomes important. GZip will help a lot compared to > plain text, but EXI is even better, especially to reduce the CPU usage of XML > generating and parsing, but also somewhat on the network bandwidth. > > Some open-source implementations of EXI are listed here: > http://en.wikipedia.org/wiki/Efficient_XML_Interchange > > I also tried the open-source Java implementation, EXIficient. In early 2010 > it implemented the standard technically correctly, but was immature, slow > (really, really slow!), and used a lot of memory. However, it's been a year > since, so maybe it's improved since then. I talked to the developers (from > Siemens) about their use of the GPL license, and they were not interested in > changing to an Apache-compatible license, so that may be an issue for use in > UIMA. > > I have not tried the other open-source Java implementation, OpenEXI. This > uses the Apache license though. There's some discussion of gzipped XML text > versus EXI here: http://openexi.sourceforge.net/#whynotgzip > > There's also an open-source C implementation, called EXIP. I don't know > anything about it. > > >> Support to for gzipped XMI files >> -------------------------------- >> >> Key: UIMA-2128 >> URL: https://issues.apache.org/jira/browse/UIMA-2128 >> Project: UIMA >> Issue Type: Wish >> Components: CasEditor >> Reporter: Richard Eckart de Castilho >> >> Since XMI files tend to grow rather rapidly, it would be great if the CAS >> Editor supported to read and write gzipped XMI files (.xmi.gz). > -- > This message is automatically generated by JIRA. > For more information on JIRA, see: http://www.atlassian.com/software/jira > > >
