Hi, Alexis,

if you should not track from which document SMILES come, you may just combine all words from all document in a list, take only unique words and try to test them. Thus, you should not store and check for valid/non-valid strings. That would reduce problem complexity as well.

Pavel.

On 12/02/2016 11:11 AM, Greg Landrum wrote:
An initial start on some regexps that match SMILES is here: https://gist.github.com/lsauer/1312860/264ae813c2bd2c27a769d261c8c6b38da34e22fb

that may also be useful

On Fri, Dec 2, 2016 at 11:07 AM, Alexis Parenty <alexis.parenty.h...@gmail.com <mailto:alexis.parenty.h...@gmail.com>> wrote:

    Hi Markus,


    Yes! I might discover novel compounds that way!! Would be
    interesting to see how they look like…


    Good suggestion to also store the words that were correctly
    identified as SMILES. I’ll add that to the script.


    I also like your “distribution of word” idea. I could safely skip
    any words that occur more than 1% of the time and could try to
    play around with the threshold to find an optimum.


    I will try every suggestions and will time it to see what is best.
    I’ll keep everyone in the loop and will share the script and results.


    Thanks,


    Alexis


    On 2 December 2016 at 10:47, Markus Sitzmann
    <markus.sitzm...@gmail.com <mailto:markus.sitzm...@gmail.com>> wrote:

        Hi Alexis,

        you may find also so some "novel" compounds by this approach :-).

        Whether your tuple solution improves performance strongly
        depends on the content of your text documents and how often
        they repeat the same words again - but my guess would be it
        will help. Probably the best way is even to look at the
        distribution of words before you feed them to RDKit. You
        should also "memorize" those ones that successfully generated
        a structure, doesn't make sense to do it again, then.

        Markus

        On Fri, Dec 2, 2016 at 10:21 AM, Maciek Wójcikowski
        <mac...@wojcikowski.pl <mailto:mac...@wojcikowski.pl>> wrote:

            Hi Alexis,

            You may want to filter with some regex strings containing
            not valid characters (i.e. there is small subset of atoms
            that may be without brackets). See "Atoms" section:
            http://www.daylight.com/dayhtml/doc/theory/theory.smiles.html
            <http://www.daylight.com/dayhtml/doc/theory/theory.smiles.html>


            The set might grow pretty quick and may be inefficient, so
            I'd parse all strings passing above filter. Although there
            will be some false positives like "CC" which may occur in
            text (emails especially).

            ----
            Pozdrawiam,  |  Best regards,
            Maciek Wójcikowski
            mac...@wojcikowski.pl <mailto:mac...@wojcikowski.pl>

            2016-12-02 10:11 GMT+01:00 Alexis Parenty
            <alexis.parenty.h...@gmail.com
            <mailto:alexis.parenty.h...@gmail.com>>:

                Dear all,


                I am looking for a way to extract SMILES scattered in
                many text documents (thousands documents of several
                pages each).

                At the moment, I am thinking to scan each words from
                the text and try to make a mol object from them using
                Chem.MolFromSmiles() then store the words if they
                return a mol object that is not None.

                Can anyone think of a better/quicker way?


                Would it be worth storing in a tuple any word that do
                not return a mol object from Chem.MolFromSmiles() and
                exclude them from subsequent search?


                Something along those lines


                excluded_set = set()

                smiles_list = []

                For each_word in text:

                If each_word not in excluded_set:

                          each_word_mol = Chem.MolFromSmiles(each_word)

                          if each_word_mol is not None:

                smiles_list.append(each_word)

                           else:

                 excluded_set.add(each_word_mol)


                Would not searching into that growing tuple take
                actually more time than trying to blindly make a mol
                object for every word?

                Any suggestion?


                Many thanks and regards,


                Alexis


                
------------------------------------------------------------------------------
                Check out the vibrant tech community on one of the
                world's most
                engaging tech sites, SlashDot.org!
                http://sdm.link/slashdot
                _______________________________________________
                Rdkit-discuss mailing list
                Rdkit-discuss@lists.sourceforge.net
                <mailto:Rdkit-discuss@lists.sourceforge.net>
                https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
                <https://lists.sourceforge.net/lists/listinfo/rdkit-discuss>



            
------------------------------------------------------------------------------
            Check out the vibrant tech community on one of the world's
            most
            engaging tech sites, SlashDot.org! http://sdm.link/slashdot
            _______________________________________________
            Rdkit-discuss mailing list
            Rdkit-discuss@lists.sourceforge.net
            <mailto:Rdkit-discuss@lists.sourceforge.net>
            https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
            <https://lists.sourceforge.net/lists/listinfo/rdkit-discuss>



        
------------------------------------------------------------------------------
        Check out the vibrant tech community on one of the world's most
        engaging tech sites, SlashDot.org! http://sdm.link/slashdot
        _______________________________________________
        Rdkit-discuss mailing list
        Rdkit-discuss@lists.sourceforge.net
        <mailto:Rdkit-discuss@lists.sourceforge.net>
        https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
        <https://lists.sourceforge.net/lists/listinfo/rdkit-discuss>



    
------------------------------------------------------------------------------
    Check out the vibrant tech community on one of the world's most
    engaging tech sites, SlashDot.org! http://sdm.link/slashdot
    _______________________________________________
    Rdkit-discuss mailing list
    Rdkit-discuss@lists.sourceforge.net
    <mailto:Rdkit-discuss@lists.sourceforge.net>
    https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
    <https://lists.sourceforge.net/lists/listinfo/rdkit-discuss>




------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot


_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to