Ilian,

Yes, we have a workaround, so it's not a critical problem, just a feature I was surprised to see unimplemented.  Increasing the size of the database to retain the original formatting is an acceptable tradeoff for us.

Thanks, Jay

On 9/14/06, Ilian Kitchukov <[EMAIL PROTECTED]> wrote:
Hi Jay,

I am sorry we are postponing solving your issue. I had a look at our code
and we can provide an option to preserve the original markup annotation set
along with the document as we considered it could be really useful even for
us. Preserving the original content however is not efficient because it will
double the size of the object. Tell me if this is OK with you, having the
OriginalMarkup annotation set (not empty)?

However what we must do is make some API changes to provide means of
creating documents with original markup and this will reflect not only the
PopulationTool. Unfortunately, we are in rush for some tests right now, and
except that I have no time to do this at the moment I don't really want to
change it before I finish those test because it will become a sort of mess.
Since you have found a workaround I hope it's not too inconvinient for you
to wait a week more for this extension.

Excuse me once again for making you wait so long,
Ilian

----- Original Message -----
From: "Jay Johnston" <[EMAIL PROTECTED]>
To: "Ilian Kitchukov" < [EMAIL PROTECTED]>
Cc: "Keith Suderman" <[EMAIL PROTECTED]>; <[email protected]>
Sent: 07 September 2006, Thursday 19:27
Subject: Re: [KIM-discussion] Formatting in Documents


> No big rush.  I'm pulling the original document from the filesystem for
> now,
> so I have a workaround.
>
> I looked at the source for GATE's DocumentImpl class and if you call
> setPreserveOriginalContent(true), it stores the original content in a
> feature called 'originalContent'.  It could then be retrieved from there
> using document.getFeatures().get("originalConent"), unless you wanted to
> create a convenience function getOriginalContent() to do just that, which
> I
> think would be useful for others as well.
>
> Thanks, Jay
>
>
> On 9/7/06, Ilian Kitchukov <[EMAIL PROTECTED]> wrote:
>>
>> Hi Jay, Keith,
>>
>> I really hope it is not too urgent. I will investigate the problem and
>> will
>> have an naswer for you of course but this will happen after the series of
>> meeting we have this week is finished. I am sorry for the delay but I
>> just
>> can't spend time on this at the moment. However it is true that an
>> exported
>> document that was populated with the Population Tool has an empty
>> OriginalMarkup annotation set. But the toXML() method exports the
>> internal
>> GATE doc which means this annotation set is empty in the GATE doc itself.
>> So
>> I suppose Jay's guess about the way a document is created is the
>> important
>> thing.
>>
>> I will clarify this and will be back with an naswer asap.
>> Ilian
>>
>> ----- Original Message -----
>> From: "Jay Johnston" <[EMAIL PROTECTED]>
>> To: "Keith Suderman" < [EMAIL PROTECTED]>
>> Cc: <[email protected]>
>> Sent: 07 September 2006, Thursday 01:30
>> Subject: Re: [KIM-discussion] Formatting in Documents
>>
>>
>> > Unfortunately a KIMDocument doesn't expose all the functions of the
>> > underlying GATE Document, so while KIMDocument.toXML() is available
>> > (apparently a thin wrapper around GATE's DocumentImpl.toXml()),
>> > KIMDocument.toXML(Set annotationSet) isn't.  To implement this using
>> this
>> > solution I would have to access the Lucene data store manually,
>> > deserialize
>> > the document, and use this method to reconstruct the original document.
>> > Seems quite inconvenient and very resource-intensive for just getting
>> the
>> > original document.  I'd rather just read the original document from the
>> > filesystem (which is the solution I've implemented).  If it was
>> convenient
>> > and quick for me to do it through KIM I would, but it's not.
>> >
>> > Further, I don't think the original markup is being retained by KIM.  I
>> > just
>> > tried your code and it doesn't return the markup as expected.  In fact,
>> > when
>> > you call toXML() on a KIMDocument, the "Original markups" tags are
>> empty:
>> > "<AnnotationSet Name="Original markups" ></AnnotationSet>" I'm guessing
>> > setting gate.DocumentImpl.setPreserveOriginalContent() to false (as KIM
>> > does) tells gate not to store this information (the Gate javadocs are a
>> > bit
>> > hazy on this).
>> >
>> > On 9/6/06, Keith Suderman < [EMAIL PROTECTED]> wrote:
>> >>
>> >> Hi Jay,
>> >>
>> >> What you need to do is call the document's toXml() method and pass in
>> >> the AnnotationSet that contains the annotations you want to
>> >> include.  The original annotations will be in an annotation set named
>> >> "Original markups" so you will need to use something like:
>> >>
>> >> gate.AnnotationSet aSet = document.getAnnotations("Original markups");
>> >> if (aSet != null)
>> >> {
>> >>          String xml = document.toXml(aSet);
>> >>          ...
>> >> }
>> >>
>> >> This won't reproduce the input document exactly as GATE will stick
>> >> gateID attributes on each annotation.
>> >>
>> >> Keith
>> >>
>> >> At 03:26 AM 9/6/2006, borislav popov wrote:
>> >> >Hi Jay,
>> >> >     There should be a way to preserve the original markup since we
>> use
>> >> >the GATE document model underneath. However this is true for some of
>> the
>> >> >methods for creation of KIM Document, and not all. We have to check
>> >> >which method of creation is used in the population tool and determine
>> >> >how the formatting can be preserved.
>> >> >Please be patient if we do not answer today, because it is a national
>> >> >holiday.
>> >> >b
>> >> >
>> >> >Johnston wrote:
>> >> > > When using the Population Tool and the KIM API, I see no way of
>> >> > > returning a version of the stored document with original
>> markup.  For
>> >> > > example, if the source documents are html or xml files,
>> >> > > KIMDocument.getContent() returns a plaintext version of the
>> document
>> >> > > stripped of all tags.  The KIMDocument.toXML() method returns an
>> XML
>> >> > > file tagged with annotations and features, but not the original
>> >> > > markup.  Is there some method I'm missing that will do this or do
>> >> > > I
>> >> > > need to implement this feature myself?
>> >> > >
>> >> > > Thanks, Jay
>> >> > >
>> >> > >
>> >>
>> ------------------------------------------------------------------------
>> >> > >
>> >> > > _______________________________________________
>> >> > > NOTE: Please REPLY TO ALL to ensure that your reply reaches all
>> >> > members of this mailing list.
>> >> > >
>> >> > > KIM-discussion mailing list
>> >> > > [email protected]
>> >> > > http://ontotext.com/mailman/listinfo/kim-discussion_ontotext.com
>> >> > >
>> >> > >
>> >>
>> ------------------------------------------------------------------------
>> >> > >
>> >> > > No virus found in this incoming message.
>> >> > > Checked by AVG Free Edition.
>> >> > > Version: 7.1.405 / Virus Database: 268.11.7/436 - Release Date:
>> >> 9/1/2006
>> >> > >
>> >> >
>> >> >_______________________________________________
>> >> >NOTE: Please REPLY TO ALL to ensure that your reply reaches all
>> >> >members of this mailing list.
>> >> >
>> >> >KIM-discussion mailing list
>> >> >[email protected]
>> >> > http://ontotext.com/mailman/listinfo/kim-discussion_ontotext.com
>> >>
>> >> --------------------------------------------------
>> >> Research Associate
>> >> American National Corpus
>> >> [EMAIL PROTECTED]
>> >> http://americannationalcorpus.org
>> >>
>> >>
>> >> _______________________________________________
>> >> NOTE: Please REPLY TO ALL to ensure that your reply reaches all
>> >> members
>> >> of
>> >> this mailing list.
>> >>
>> >> KIM-discussion mailing list
>> >> [email protected]
>> >> http://ontotext.com/mailman/listinfo/kim-discussion_ontotext.com
>> >>
>> >
>> >
>> > __________ NOD32 1.1742 (20060906) Information __________
>> >
>> > This message was checked by NOD32 antivirus system.
>> > http://www.eset.com
>> >
>> >
>>
>>
>>
>> --------------------------------------------------------------------------------
>>
>>
>> > _______________________________________________
>> > NOTE: Please REPLY TO ALL to ensure that your reply reaches all members
>> of
>> > this mailing list.
>> >
>> > KIM-discussion mailing list
>> > [email protected]
>> > http://ontotext.com/mailman/listinfo/kim-discussion_ontotext.com
>> >
>> >
>> > __________ NOD32 1.1742 (20060906) Information __________
>> >
>> > This message was checked by NOD32 antivirus system.
>> > http://www.eset.com
>> >
>> >
>>
>

_______________________________________________
NOTE: Please REPLY TO ALL to ensure that your reply reaches all members of this 
mailing list.

KIM-discussion mailing list
[email protected]
http://ontotext.com/mailman/listinfo/kim-discussion_ontotext.com

Reply via email to