Re: [jira] [Commented] (UIMA-2128) Support to for gzipped XMI files

Richard Eckart de Castilho Fri, 15 Jul 2011 08:45:57 -0700

I don't see why a pluggable zip should be necessary. Java supports ZIP (JAR) 
out of the box using the classes in java.util.zip. If the type system is not 
persisted together with the XMI, then a GZIP (Java Native) or BZIP2 (comes with 
Apache Ant) would be ok as well. Given that a reader cannot change the type 
system of a CAS, carrying a serialized type system with each XMI is 
questionable.


Cheers,

Richard

Am 15.07.2011 um 04:32 schrieb Marshall Schor:

> there seem to be lots of zip implementations - another one for instance is
> 7-zip.  I haven't studied this issue enough to have a real opinion, but if 
> zips
> are implemented, I wonder if it would be good to have some kind of a pluggable
> mechanism to allow for different zips for different circumstances.
> 
> -Marshall
> 
> On 7/14/2011 4:32 PM, Greg Holmberg (JIRA) wrote:
>>    [ 
>> https://issues.apache.org/jira/browse/UIMA-2128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13065516#comment-13065516
>>  ] 
>> 
>> Greg Holmberg commented on UIMA-2128:
>> -------------------------------------
>> 
>> I'm not sure where in the code it should be implemented (which seems to be 
>> Jörn's point), but another technology option is EXI, a binary encoding for 
>> XML.  See http://www.w3.org/XML/EXI 
>> 
>> I've experimented with this to encode XMI between a UIMA process and a 
>> database on separate machine (both ends Java).  I used a commercial 
>> implementation, Efficient XML from AgileDelta, and did some throughput 
>> measurements, comparing it to gzipped XMI XML.  I found that EXI produces 
>> somewhat smaller data than gzipped XML text (maybe 10% or 20% smaller, if I 
>> remember correctly).  The biggest benefit to EXI was the amount of CPU time 
>> required to read and write.  It was quite a bit faster than gzip to generate 
>> the XML and parse the XML.  This is probably because it does so directly 
>> from the ContentHandler, whereas with gzipped text, you first have to write 
>> the text and then compress it.  Also, I suppose it's just more efficient to 
>> parse a binary format than to step through characters looking for certain 
>> tokens.
>> 
>> In my case, the improved throughput and reduction of CPU usage was most 
>> important on the receiving end (i.e. the database) since it is a central 
>> bottle-neck in the overall landscape of my system.  As the number of UIMA 
>> senders increases, the database reaches it's limits to handle more messages 
>> (no more CPU or NIC capacity on that machine).  So it was important to me to 
>> be as efficient as possible with the XMI parsing on that machine in order to 
>> minimize my hardware costs.
>> 
>> In my case, all annotators are local, but a similar bottle-neck situation 
>> could arise in UIMA AS if you have a remote annotator (service).  Then, 
>> making that UIMA processor as efficient as possible in terms of both CPU and 
>> network bandwidth usage becomes important.  GZip will help a lot compared to 
>> plain text, but EXI is even better, especially to reduce the CPU usage of 
>> XML generating and parsing, but also somewhat on the network bandwidth.
>> 
>> Some open-source implementations of EXI are listed here: 
>> http://en.wikipedia.org/wiki/Efficient_XML_Interchange
>> 
>> I also tried the open-source Java implementation, EXIficient.  In early 2010 
>> it implemented the standard technically correctly, but was immature, slow 
>> (really, really slow!), and used a lot of memory.  However, it's been a year 
>> since, so maybe it's improved since then.  I talked to the developers (from 
>> Siemens) about their use of the GPL license, and they were not interested in 
>> changing to an Apache-compatible license, so that may be an issue for use in 
>> UIMA.
>> 
>> I have not tried the other open-source Java implementation, OpenEXI. This 
>> uses the Apache license though.  There's some discussion of gzipped XML text 
>> versus EXI here: http://openexi.sourceforge.net/#whynotgzip 
>> 
>> There's also an open-source C implementation, called EXIP.  I don't know 
>> anything about it.
>> 
>> 
>>> Support to for gzipped XMI files
>>> --------------------------------
>>> 
>>>                Key: UIMA-2128
>>>                URL: https://issues.apache.org/jira/browse/UIMA-2128
>>>            Project: UIMA
>>>         Issue Type: Wish
>>>         Components: CasEditor
>>>           Reporter: Richard Eckart de Castilho
>>> 
>>> Since XMI files tend to grow rather rapidly, it would be great if the CAS 
>>> Editor supported to read and write gzipped XMI files (.xmi.gz).
>> --
>> This message is automatically generated by JIRA.
>> For more information on JIRA, see: http://www.atlassian.com/software/jira
>> 
>> 
>> 

Richard Eckart de Castilho

-- 
------------------------------------------------------------------- 
Richard Eckart de Castilho
Technical Lead
Ubiquitous Knowledge Processing Lab 
FB 20 Computer Science Department      
Technische Universität Darmstadt 
Hochschulstr. 10, D-64289 Darmstadt, Germany 
phone [+49] (0)6151 16-7477, fax -5455, room S2/02/B117
[email protected] 
www.ukp.tu-darmstadt.de 
Web Research at TU Darmstadt (WeRC) www.werc.tu-darmstadt.de
-------------------------------------------------------------------

Re: [jira] [Commented] (UIMA-2128) Support to for gzipped XMI files

Reply via email to