I don't see why a pluggable zip should be necessary. Java supports ZIP (JAR) out of the box using the classes in java.util.zip. If the type system is not persisted together with the XMI, then a GZIP (Java Native) or BZIP2 (comes with Apache Ant) would be ok as well. Given that a reader cannot change the type system of a CAS, carrying a serialized type system with each XMI is questionable.
Cheers, Richard Am 15.07.2011 um 04:32 schrieb Marshall Schor: > there seem to be lots of zip implementations - another one for instance is > 7-zip. I haven't studied this issue enough to have a real opinion, but if > zips > are implemented, I wonder if it would be good to have some kind of a pluggable > mechanism to allow for different zips for different circumstances. > > -Marshall > > On 7/14/2011 4:32 PM, Greg Holmberg (JIRA) wrote: >> [ >> https://issues.apache.org/jira/browse/UIMA-2128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13065516#comment-13065516 >> ] >> >> Greg Holmberg commented on UIMA-2128: >> ------------------------------------- >> >> I'm not sure where in the code it should be implemented (which seems to be >> Jörn's point), but another technology option is EXI, a binary encoding for >> XML. See http://www.w3.org/XML/EXI >> >> I've experimented with this to encode XMI between a UIMA process and a >> database on separate machine (both ends Java). I used a commercial >> implementation, Efficient XML from AgileDelta, and did some throughput >> measurements, comparing it to gzipped XMI XML. I found that EXI produces >> somewhat smaller data than gzipped XML text (maybe 10% or 20% smaller, if I >> remember correctly). The biggest benefit to EXI was the amount of CPU time >> required to read and write. It was quite a bit faster than gzip to generate >> the XML and parse the XML. This is probably because it does so directly >> from the ContentHandler, whereas with gzipped text, you first have to write >> the text and then compress it. Also, I suppose it's just more efficient to >> parse a binary format than to step through characters looking for certain >> tokens. >> >> In my case, the improved throughput and reduction of CPU usage was most >> important on the receiving end (i.e. the database) since it is a central >> bottle-neck in the overall landscape of my system. As the number of UIMA >> senders increases, the database reaches it's limits to handle more messages >> (no more CPU or NIC capacity on that machine). So it was important to me to >> be as efficient as possible with the XMI parsing on that machine in order to >> minimize my hardware costs. >> >> In my case, all annotators are local, but a similar bottle-neck situation >> could arise in UIMA AS if you have a remote annotator (service). Then, >> making that UIMA processor as efficient as possible in terms of both CPU and >> network bandwidth usage becomes important. GZip will help a lot compared to >> plain text, but EXI is even better, especially to reduce the CPU usage of >> XML generating and parsing, but also somewhat on the network bandwidth. >> >> Some open-source implementations of EXI are listed here: >> http://en.wikipedia.org/wiki/Efficient_XML_Interchange >> >> I also tried the open-source Java implementation, EXIficient. In early 2010 >> it implemented the standard technically correctly, but was immature, slow >> (really, really slow!), and used a lot of memory. However, it's been a year >> since, so maybe it's improved since then. I talked to the developers (from >> Siemens) about their use of the GPL license, and they were not interested in >> changing to an Apache-compatible license, so that may be an issue for use in >> UIMA. >> >> I have not tried the other open-source Java implementation, OpenEXI. This >> uses the Apache license though. There's some discussion of gzipped XML text >> versus EXI here: http://openexi.sourceforge.net/#whynotgzip >> >> There's also an open-source C implementation, called EXIP. I don't know >> anything about it. >> >> >>> Support to for gzipped XMI files >>> -------------------------------- >>> >>> Key: UIMA-2128 >>> URL: https://issues.apache.org/jira/browse/UIMA-2128 >>> Project: UIMA >>> Issue Type: Wish >>> Components: CasEditor >>> Reporter: Richard Eckart de Castilho >>> >>> Since XMI files tend to grow rather rapidly, it would be great if the CAS >>> Editor supported to read and write gzipped XMI files (.xmi.gz). >> -- >> This message is automatically generated by JIRA. >> For more information on JIRA, see: http://www.atlassian.com/software/jira >> >> >> Richard Eckart de Castilho -- ------------------------------------------------------------------- Richard Eckart de Castilho Technical Lead Ubiquitous Knowledge Processing Lab FB 20 Computer Science Department Technische Universität Darmstadt Hochschulstr. 10, D-64289 Darmstadt, Germany phone [+49] (0)6151 16-7477, fax -5455, room S2/02/B117 [email protected] www.ukp.tu-darmstadt.de Web Research at TU Darmstadt (WeRC) www.werc.tu-darmstadt.de -------------------------------------------------------------------
