Just saw this mail thread. anyway below is Python code to extract all
nepali words from the example of text you gave.
# -*- coding: utf-8-*-

data = """
<page>
[[en:Apple]]
[[ne:स्याउ]]
[[new:स्याउ]]
[[hi:सेव]]
[[fr:????]]
</page>
"""
def get_next_target(data):
    start_link = data.find('[[ne:')
    if start_link == -1:
        return None, 0
    start_quote = data.find('[[ne:', start_link)
    end_quote = data.find(']]', start_quote + 1)
    nepWord = data[start_quote + 1:end_quote]
    nepWord = nepWord.split(":")[-1]
    return nepWord, end_quote

def get_all_nepData(data):
    links = []
    while True:
        url, endpos = get_next_target(data)
        if url:
            links.append(url)
            data = data[endpos:]
        else:
            break
    return links

if __name__ == "__main__":
        t = get_all_nepData(data)--
        for i in t:
                print i

Regarding autocomplete and word suggestion you might want to look at
Bayes Theorem and using bulk text. You might want to read this paper
thoroughly --- http://norvig.com/spell-correct.html

Pravin

On Apr 11, 10:00 am, Rajesh Pandey <[email protected]> wrote:
> Hi Folks,
> *"If any one of you are interested in this please reply, so that we could
> work in this. "*
>
> I am interested to make a group of few people who would be interested in
> data mining. If you are already involved in nlp-class.org. that would be
> great as well.
> Not to be confused with the word "data mining", The only thing we would do
> is extract Nepali words from wiktionary database
> dump<http://dumps.wikimedia.org/backup-index.html>where we would
> extract Nepali words and save them so that they could be
> used for various purposes.
> For instance:
> 1) Autocomplete
> 2) Nepali corpus
> 3) Nepali translator
>
> How "Autocomplete" works is providing suggestions while we start typing, if
> we have a list of words, we can provide suggestions for the users.
>
> The Nepali corpus, which contains words which are tagged as "Noun",
> "Adjective" etc can be created. I wish to use them in one of the "open
> source translator for
> Nepali<http://code.google.com/p/nepaliwikipediatranslator>"
> in which I am also involved in.
>
> The database dump of Wiktionary has an XML file which contains a lot of
> words and their English equivalents along with equivalents in other
> available languages.
>
> For instance : There would be
> <page>
> [[en:Apple]]
> [[ne:स्याउ]]
> [[new:स्याउ]]
> [[hi:सेव]]
> [[fr:????]]
> </page>
>
> etc
> So we need to extract स्याउ and Apple or a list of स्याउ, केरा , सुन्तला in
> a file. So that we could suggest स्याउ when a user starts typing स  or
> suggest केरा when a user starts writing क . This is autocomplete.
>
> When we have स्याउ and Apple, we will have a Nepali translator as well.
>
> ==================
> Sorry for the ambiguous subject: Natural language processing: I could have
> added a more specific title, or "Data mining" would have been another
> subject. Thanks for your patience in reading this email :).
> ======================
> Want to create a web based php/python/java application [Nepali translator]
> based on code.google.com/p/nepaliwikipediatranslator ?, You are welcome.
> (Not .Net, because we already have a lot of stuff in .NET, and we are
> looking for .net alternatives so that we could use them in Linux easily)
> ======================
> --
> Rajesh Pandey

-- 
FOSS Nepal mailing list: [email protected]
http://groups.google.com/group/foss-nepal
To unsubscribe, e-mail: [email protected]

Mailing List Guidelines: 
http://wiki.fossnepal.org/index.php?title=Mailing_List_Guidelines
Community website: http://www.fossnepal.org/

Reply via email to