[Corpora-List] Re: Any literature about tensors-based corpora NLP research with actual examples (and homework ;-)) you would suggest? ...

Anil Singh via Corpora Sat, 05 Aug 2023 09:26:49 -0700

On Sat, Aug 5, 2023 at 6:56 PM Ada Wan <[email protected]> wrote:


> Hi Anil
>
> Thanks for your comments. (And thanks for reading my work.)
>
> Yeah, there is a lot that one has to pay attention to when it comes to
> what "textual computing" entails (and to which extent it "exists"). Beyond
> "grammar" definitely. But experienced CL folks should know that. (Is this
> you btw: https://scholar.google.com/citations?user=QKnpUbgAAAAJ?
>

Yes, that's me, for the better or for the worse.


> If not, do you have a webpage for your work? Nice to e-meet you either
> way!)
>
>
Thank you.

Re "I know first hand the problems in doing NLP for low resource languages
> which are related to text encodings":
> which specific languages/varieties are you referring to here? If the issue
> lies in the script not having been encoded, one can contact SEI about it (
> https://linguistics.berkeley.edu/sei/)? I'm always interested in knowing
> what hasn't been encoded. Are the scripts on this list (
> https://linguistics.berkeley.edu/sei/scripts-not-encoded.html)?
>
>
Well, that's a long story. It is related to the history of adaptation of
computers by the public at large in India. The really difficult part is not
about scripts being encoded. Instead, it is about a script being
over-encoded or encoded in a non-standard way. And the lack of adoption of
standard encodings and input methods. Just to give one example, even though
a single encoding (called ISCII) for all Brahmi-origin scripts of India was
created officially, most people were unaware of it or didn't use it for so
many reasons. One major reason being that it was not supported on Operating
Systems, including Windows (which was anyway developed many years after
creation of ISCII). Input methods and rendering engines for it were not
available. You had to have a special terminal to use it, but that was text
only terminal, used mainly in research centers and perhaps for some very
limited official purposes. And computers, as far as the general public was
concerned, were most commonly used for DeskTop Publishing (which became
part of Indian languages as "DTP"). These non-standard encodings were
mainly font-encodings, just to enable proper rendering of text for
publishing. One of the most popular 'encodings' was based on the Remington
typewriter for Hindi. Another was based on mostly phonetic mapping from
Roman to Devanagari. Other languages which did not use Devanagari also had
their own non-standard encodings, often multiple encodings. The reason
these became popular was that they enabled people to type in Indian
languages and see the text rendered properly, since no other option was
available and they understandably didn't really care about it being
standard or not. It wasn't until recently that Indic scripts were properly
supported by any OS's. It is possible that even now, when Unicode is
supported on most OS's and input methods are available as part of OS's,
there are people still using non-standard encodings. Even now, you can come
across problems related to either input methods or rendering for Indic
scripts on OS's. And most importantly, there is still no universally
accepted way to actually use these input methods. Most Indians, for
example, on social media and even in emails type in some pseudo-phonetic
way using Roman letters with the QWERTY keyboard. Typing in Indian
languages using Indic scripts is still a specialized skill.

The result of all this is that when you try to collect data for low
resource languages, including major languages of India, there may be a lot
of data -- or perhaps even all the data, depending on the language -- which
is in some non-standard ad-hoc encoding which has non-trivial mapping with
Unicode. This is difficult partly because non-standard encodings are often
based on glyphs, rather than actual units of the script. So, to be able to
use it you need a perfect encoding converter to get the text in Unicode
(UTF-8). Such converters have been there for a long time, but since they
were difficult to create, they were/are proprietary and not available even
to researchers in most cases. It seems a pretty good OCR system has been
developed for Indic scripts/languages, but I have not yet had the chance to
try it.

For example, I am currently (for the last few years) working on Bhojpuri,
Magahi and Maithili. When we tried to collect data for these languages,
there was the same problem, which is actually not really a problem for the
general public because their purpose is served by these non-standard
encodings, but for NLP/CL you face difficulty in getting the data in a
usable form.

This is just a brief overview and I also don't really know the full extent
of it, in the sense that I don't have a comprehensive list of such
non-standard encodings for all Indic scripts.


> Re the unpublished paper (on a computational typology of writing
> systems?):
> when and to where (as in, which venues/publications) did you submit it?
> I remember one of my first term papers from the 90s being on the
> phonological system of written Cantonese (or sth like that --- don't
> remember my wild days), the prof told me it wasn't "exactly linguistics"...
>
> I had submitted to the journal Written Language and Literacy in 2009. It
was actually mostly my mistake that I didn't submit a revised version of
the paper as I was going through a difficult period then.


> Re "on building an encoding converter that will work for all 'encodings'
> used for Indian languages":
> this sounds interesting!
>
>
Yes, I still sometimes wish I could build it.


> Re "I too wish there was a good comprehensive history text encodings,
> including non-standard ad-hoc encodings":
> what do you mean by that --- history of text encodings or historical text
> encodings?
> After my discoveries from recent years, when my "mental model" towards
> what's been practiced in the language space (esp. in CL/NLP) finally 
> *completely
> *shifted, I had wanted to host (or co-host) a tutorial on character
> encoding for those who might be under-informed on the matter (including but
> not limited to the "grammaroholics" (esp. the CL/NLP practitioners who seem
> to be stuck doing grammar, even in the context of computing) --- there are
> so many of them! :) )
>
>
I mostly meant the non-standard 'encodings' (really just ad-hoc mappings)
to serve someone's current purpose. To fully understand the situation, you
have to be familiar with social-political-economic-etc. aspects of the
language situation in India.


> Re "word level language identification":
> I don't do "words" anymore. In that 2016 TBLID paper of mine, I
> (regrettably) was still going with the flow in under-reporting on
> tokenization procedures (like what many "cool" ML papers did). But "words"
> do certainly shape the results! I'm really forward to everyone working with
> full-vocabulary, pure character or byte formats (depending on the task),
> while being 100% aware of statistics. Things can be much more transparent
> and easily replicable/reproducible that way anyway.
>
>
Well, I used the word 'word' as just a shorthand for space separated
segments. In my PhD thesis, I had also argued against word being the unit
of computational processing or whatever you call it. I had called the unit
Extra-Lexical Unit, consisting of a core morpheme and inflectional parts. I
realize now that even that may not necessarily work for languages with
highly fusional morphology. But, something like this is now the preferred
unit of morphological processing, as in the CoNLL shared tasks and
UniMorph.  I also realize that I could not have been the first to come to
this conclusion.


> Re "We have to be tolerant of what you call bad research for various
> unavoidable reasons. Research is not what it used to be":
> No, I think one should just call out bad research and stop doing it. I
> wouldn't want students to burn their midnight oil working hard for nothing.
> Bad research warps also expectations and standards, in other sectors as
> well (education, healthcare, commerce... etc.). Science, as in the pursuit
> of truth and clarity, is and should be the number 1 priority of any decent
> research. (In my opinion, market research or research for marketing
> purposes should be all consolidated into one track/venue if they lack
> scientific quality.) I agree research is not what it used to be --- but in
> the sense that the quality is much worse in general, much hacking around
> with minor, incremental improvements. Like in the case of "textual
> computing", people are "grammar"-hacking.
>
>
I completely agree with you, but ... Sometimes it is wise to be silent.


> Re *better ... gender representation":
> hhmm... I'm not so sure about that.
>
>
You are a better judge of that. I just shared my opinion, which may not be
completely free from bias, although I do try.


> Re "About grammar, I have come to think of it as a kind of language model
> for describing some linguistic phenomenon":
> nah, grammar not necessary.
>
>
I don't say it is necessary, but I see it as one possible model to describe
language, which can be useful for some -- like educational -- purposes if
used in the right way.


> Re grammaroholic reviewers:
> yeah, there are tons of those in the CL/NLP space. I think many of them
> are only willing and/or able to critique on grammar. Explicit is that it
> shows that they don't want to check one's math and code --- besides, when
> most work on "words" anyway, there is a limit to how things are
> replicable/reproducible, esp. if on a different dataset. The implicit bit,
> however, is that I think there is some latent intent to introduce/reinforce
> the influence of "grammar" into the computing space. That, I do not agree
> with at all.
>
>
I should confess that I sometimes am guilty of that (pointing out
grammatical mistakes) myself. However, the situation is complicated in
countries like India due to historical and other reasons. I think the
papers should at least be in a condition that they can be understood
roughly as intended. This may not always be the case, particularly with
non-native speakers of English, or people who are not yet speakers/writers
of English at all. Now, perhaps no one knows better than myself that it is
not really their fault completely, but as a reviewer, sometimes your
patience is severely tested.


> Re "magic":
> yes, once one gets over the hype, it's just work.
>
>
True, but what I said is based on where I am coming from (as in, to this
position), which will take a really long time to explain. Of course, I
don't literally mean magic.

Re "I have no experience of field work at all and that I regret, but it is
> partly because I am not a social creature":
> one can be doing implicit and unofficial "fieldwork" everyday if one pays
> attention to how language is used.
>
>
That indeed I do all the time. I meant official fieldwork.


> Best
> Ada
>
> On Sat, Aug 5, 2023 at 8:51 AM Anil Singh <[email protected]> wrote:
>
>> I forgot the main reason for writing the last email. Most importantly, I
>> share your view that orthography is underrepresented in NLP/CL. I had once
>> tried to build a computational typology of writing systems. The paper was
>> not published, but I still believe that is something worth doing. Perhaps
>> one day I will complete that work.
>>
>> Also, I am conscious that, technically, I used the term category mistake
>> in a wrong way, but I hope I was understood correctly.
>>
>> On Sat, Aug 5, 2023 at 12:47 AM Hesham Haroon <[email protected]>
>> wrote:
>>
>>> Hi Ada and Anil,
>>>
>>> I'm enjoying reading your discussion. It's been very informative and
>>> thought-provoking. Thanks for sharing your insights!
>>>
>>> Best,
>>> Hesham
>>>
>>>
>>> On Fri, Aug 4, 2023, 8:51 PM Anil Singh via Corpora <
>>> [email protected]> wrote:
>>>
>>>> I have been enjoying the discussion. I hope it will continue. I have
>>>> learnt some new things. I was also confused about the tensor thing,
>>>> although not in the same way.
>>>>
>>>> I hope I am not among one of the scare quoted NLP practitioners,
>>>> because that's exactly what I like to call myself. I certainly don't think
>>>> I am qualified to work on language just because I can speak one.
>>>>
>>>> I am currently reading your thesis and trying to digest it.
>>>>
>>>> I also glanced through the syllabus you are preparing. I share your
>>>> interest in text encodings. among other things. I can't resist talking
>>>> about text encodings, whether I am teaching NLP or Computer Programming,
>>>> because I know first hand the problems in doing NLP for low resource
>>>> languages which are related to text encodings.
>>>>
>>>> If you can actually teach that syllabus, I envy you as I am unable to
>>>> get people interested in the very basics of language/linguistics.
>>>>
>>>> About the importance of granularities, I had, in my (very badly
>>>> written) PhD thesis, explicitly talked about NLP problem formulation in
>>>> terms of granularities. In my second research paper, I had used byte
>>>> n-grams for language identification. I use byte n-grams whenever I can.
>>>> Actually, I used it for language-encoding pair identification, as there are
>>>> so many non-standard 'encodings' which were used and perhaps are still used
>>>> for South Asian languages. My very first -- unsuccessful or you may say
>>>> unfinished -- attempt at doing some kind of NLP even before knowing that a
>>>> field called NLP or CL existed, was on building an encoding converter that
>>>> will work for all 'encodings' used for Indian languages. I too wish there
>>>> was a good comprehensive history text encodings, including non-standard
>>>> ad-hoc encodings.
>>>>
>>>> I also share your interest in word level language identification. In
>>>> 2007 I had published one of the earliest papers on what I called language
>>>> identification in a multilingual document, where I had tried word level
>>>> language identification, and what is now called language identification for
>>>> code switched data.
>>>>
>>>> About gender, I had actually made a kind of category assumption. I
>>>> didn't pay attention to the name, which you share with no less than Ada
>>>> Byron.
>>>>
>>>> We have to be tolerant of what you call bad research for various
>>>> unavoidable reasons. Research is not what it used to be. At least that's my
>>>> opinion. Still, in some ways it is better, perhaps like in the case of
>>>> gender representation.
>>>>
>>>> About grammar, I have come to think of it as a kind of language model
>>>> for describing some linguistic phenomenon. I once received a review in
>>>> which the reviewer mentioned some grammatical mistakes and wrote that you
>>>> don't have to just see how the sentence/phrase sounds, you have to
>>>> explicitly check the grammar according to the rules. Thank you very much,
>>>> but I learnt English without paying any explicit attention to grammar. I am
>>>> pretty sure I didn't learn much from explicit teaching of grammar, whether
>>>> of English, or of Sanskrit, or of French.That doesn't necessarily mean I
>>>> don't believe in grammar, but I guess I am moving towards the language
>>>> games view of language.
>>>>
>>>> As to language being magical, well, that depends on what you mean by
>>>> magical. To me, it seems it is magical in the same sense as life itself is
>>>> magical. Nothing more, nothing less. Even computer programming I have been
>>>> known to call magical in a certain sense.
>>>>
>>>> I also completely agree that we can only hope that we are communicating
>>>> as we intended, but we rarely, if ever, actually attain that goal.
>>>>
>>>> I can't match your background, but I did have -- what can be called --
>>>> four rounds of graduate training in different disciplines. I am still
>>>> trying to learn new things about language. However, I have no experience of
>>>> field work at all and that I regret, but it is partly because I am not a
>>>> social creature, or, to be more precise (as if one can be precise with
>>>> language), I am socially totally incompetent. I wouldn't know how to
>>>> approach anyone for fieldwork in Linguistics.
>>>>
>>>> On Fri, Aug 4, 2023 at 9:03 PM Ada Wan via Corpora <
>>>> [email protected]> wrote:
>>>>
>>>>> @Toms:
>>>>> for completeness' sake: would you mind please sharing your background?
>>>>> Thanks.
>>>>>
>>>>> On Fri, Aug 4, 2023 at 5:31 PM Ada Wan <[email protected]> wrote:
>>>>>
>>>>>> Thanks x2, Ibrtchx.
>>>>>>
>>>>>> On Fri, Aug 4, 2023 at 3:30 AM Albretch Mueller <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> On 8/3/23, Toms Bergmanis <[email protected]> wrote:
>>>>>>>  ...
>>>>>>>
>>>>>>>  I, for one, have benefited from Ada's, as well as other member's
>>>>>>> suggestions and comments as I hope they have somehow benefited from
>>>>>>> mine.
>>>>>>>  lbrtchx
>>>>>>>
>>>>>> _______________________________________________
>>>>> Corpora mailing list -- [email protected]
>>>>> https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
>>>>> To unsubscribe send an email to [email protected]
>>>>>
>>>>
>>>>
>>>> --
>>>> - Anil
>>>> _______________________________________________
>>>> Corpora mailing list -- [email protected]
>>>> https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
>>>> To unsubscribe send an email to [email protected]
>>>>
>>>
>>
>> --
>> - Anil
>>
>

-- 
- Anil

_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

[Corpora-List] Re: Any literature about tensors-based corpora NLP research with actual examples (and homework ;-)) you would suggest? ...

Reply via email to