On 2022-03-07 17:05, Jen Kris wrote:
Thank you MRAB for your reply.

Regarding your first question, pSentence is a list.  In the nltk library, nltk.word_tokenize takes a string, so we convert sentence to string before we call nltk.word_tokenize:

>>> sentence = " ".join(sentence)
>>> pt = nltk.word_tokenize(sentence)
>>> print(sentence)
[ Emma by Jane Austen 1816 ]

But with the C API it looks like this:

PyObject *pSentence = PySequence_GetItem(pSents, sent_count);
PyObject* str_sentence = PyObject_Str(pSentence); // Convert to string

; See what str_sentence looks like:
PyObject* repr_str = PyObject_Repr(str_sentence);
PyObject* str_str = PyUnicode_AsEncodedString(repr_str, "utf-8", "~E~");
const char *bytes_str = PyBytes_AS_STRING(str_str);
printf("REPR_String: %s\n", bytes_str);

REPR_String: "['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']"

So the two string representations are not the same – or at least the   PyUnicode_AsEncodedString is not the same, as each item is surrounded by single quotes.

Assuming that the conversion to bytes object for the REPR is an accurate representation of str_sentence, it looks like I need to strip the quotes from str_sentence before “PyObject* pWTok = PyObject_CallFunctionObjArgs(pNltk_WTok, str_sentence, 0).”

So my questions now are (1) is there a C API function that will convert a list to a string exactly the same way as ‘’.join, and if not then (2) how can I strip characters from a string object in the C API?

Your Python code is joining the list with a space as the separator.

The equivalent using the C API is:

    PyObject* separator;
    PyObject* joined;

    separator = PyUnicode_FromString(" ");
    joined = PyUnicode_Join(separator, pSentence);
    Py_DECREF(sep);


Mar 6, 2022, 17:42 by pyt...@mrabarnett.plus.com:

    On 2022-03-07 00:32, Jen Kris via Python-list wrote:

        I am using the C API in Python 3.8 with the nltk library, and
        I have a problem with the return from a library call
        implemented with PyObject_CallFunctionObjArgs.

        This is the relevant Python code:

        import nltk
        from nltk.corpus import gutenberg
        fileids = gutenberg.fileids()
        sentences = gutenberg.sents(fileids[0])
        sentence = sentences[0]
        sentence = " ".join(sentence)
        pt = nltk.word_tokenize(sentence)

        I run this at the Python command prompt to show how it works:

                    sentence = " ".join(sentence)
                    pt = nltk.word_tokenize(sentence)
                    print(pt)

        ['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']

                    type(pt)

        <class 'list'>

        This is the relevant part of the C API code:

        PyObject* str_sentence = PyObject_Str(pSentence);
        // nltk.word_tokenize(sentence)
        PyObject* pNltk_WTok = PyObject_GetAttrString(pModule_mstr,
        "word_tokenize");
        PyObject* pWTok = PyObject_CallFunctionObjArgs(pNltk_WTok,
        str_sentence, 0);

        (where pModule_mstr is the nltk library).

        That should produce a list with a length of 7 that looks like
        it does on the command line version shown above:

        ['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']

        But instead the C API produces a list with a length of 24, and
        the REPR looks like this:

        '[\'[\', "\'", \'[\', "\'", \',\', "\'Emma", "\'", \',\',
        "\'by", "\'", \',\', "\'Jane", "\'", \',\', "\'Austen", "\'",
        \',\', "\'1816", "\'", \',\', "\'", \']\', "\'", \']\']'

        I also tried this with PyObject_CallMethodObjArgs and
        PyObject_Call without success.

        Thanks for any help on this.

    What is pSentence? Is it what you think it is?
    To me it looks like it's either the list:

    ['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']

    or that list as a string:

    "['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']"

    and that what you're tokenising.
-- https://mail.python.org/mailman/listinfo/python-list


--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to