On May 25, 2023, at 21:13, Tom Hubbard <cubb...@gmail.com> wrote:
> An InChI could not be generated and used to canonise SMILES: null
> 
> Could not generate InChI Numbers: Too many atoms [did you forget 
> 'LargeMolecules' switch?]

CDK uses InChI to generate absolute SMILES. Here's a comment from the code:

     * Create a absolute SMILES generator. Unique SMILES uses the InChI to
     * canonise SMILES and encodes isotope or stereo-chemistry. The InChI
     * module is not a dependency of the SMILES module but should be present
     * on the classpath when generation absolute SMILES.

If you remove either the SmiFlavor.Canonical or the SmiFlavor.Isomeric bit flag 
from your output flavor then you'll get a SMILES, though it won't be an 
absolute SMILES.


More specifically, CDK uses InChI to generate the atom labels used during 
canonical SMILES generation, in cdk/smiles/SmilesGenerator.java there's a code 
path which looks like:

            // apply the canonical labelling
            if (SmiFlavor.isSet(flavour, SmiFlavor.Canonical)) {

                // determine the output order
                int[] labels = labels(flavour, molecule);

where the labels() is:

    private static int[] labels(int flavour, final IAtomContainer molecule) 
throws CDKException {
        // FIXME: use SmiOpt.InChiLabelling
        long[] labels = SmiFlavor.isSet(flavour, SmiFlavor.Isomeric) ? 
inchiNumbers(molecule)
                : Canon.label(molecule,
                              GraphUtil.toAdjList(molecule),
                              createComparator(molecule, flavour));


Thus, if SmiFlavor.Canonical and SmiFlavor.Isomeric are set, it ends up using 
code in  cdk/graph/invariant/InChINumbersTools.java which configures InChI to 
do the atom order assignments, via the 'auxiliary information':

    public static long[] getNumbers(IAtomContainer atomContainer) throws 
CDKException {
        String aux = auxInfo(atomContainer, new InchiFlag[0]);
      ...

    static String auxInfo(IAtomContainer container, InchiFlag... flags) throws 
CDKException {
        InChIGeneratorFactory factory = InChIGeneratorFactory.getInstance();
        boolean org = factory.getIgnoreAromaticBonds();
        factory.setIgnoreAromaticBonds(true);
        InChIGenerator gen = factory.getInChIGenerator(container, flags);
        factory.setIgnoreAromaticBonds(org); // an option on the singleton so 
we should reset for others
        if (gen.getStatus() == InchiStatus.ERROR)
            throw new CDKException("Could not generate InChI Numbers: " + 
gen.getMessage());
        return gen.getAuxInfo();

That calls into the InChI, which has the check (actually, it's in a few places, 
all with the same idea):


    max_num_at = ip->bLargeMolecules ? MAX_ATOMS : 
NORMALLY_ALLOWED_INP_MAX_ATOMS;
    if (nNumAtoms >= max_num_at)
    {
        TREAT_ERR( *err, 0, "Too many atoms [did you forget 'LargeMolecules' 
switch?]" );
        *err = 70;
        orig_inp_data->num_inp_atoms = -1;
        goto err_exit;
    }

where

#define MAX_ATOMS  32766
#define NORMALLY_ALLOWED_INP_MAX_ATOMS 1024


The InChI flag is enabled with the flag 'LargeMolecules',  
https://github.com/dan2097/jna-inchi/blob/master/jna-inchi-api/src/main/java/io/github/dan2097/jnainchi/InchiFlag.java#L47

/** Allows input of molecules up to 32767 atoms [Produces 'InChI=1B' indicating 
beta status of resulting identifiers]*/

so it appears that changing  cdk/graph/invariant/InChINumbersTools.java line 49 
from:

        String aux = auxInfo(atomContainer, new InchiFlag[0]);

to have LargeMolecules in that 'new InchiFlag' would make this work.

However, I'm not a Java developer and don't know how to make this change nor 
test it. I can say it does not seem to be user-configurable.


I am a Python developer, and I can reproduce the error using my 'chemfp 
translate' tool, which uses a Java/Python bridge to work with the CDK. The 
following uses RDKit to translate a FASTA sequence to an SDF with 1079 atoms:


% python -c 'print(">megatryp\n" + "W"*77 + "\n")' | chemfp translate -T rdkit 
--in fasta --out sdf | head -6
megatryp
     RDKit

  0  0  0  0  0  0  0  0  0  0999 V3000
M  V30 BEGIN CTAB
M  V30 COUNTS 1079 1232 0 0 0

I can have it go from FASTA to SDF using RDKit then have CDK read the SDF to 
produce the SMILES generation failure:

% python -c 'print(">megatryp\n" + "W"*77 + "\n")' | chemfp translate -T rdkit 
--in fasta --via sdf -U cdk --out smi
Error: CDK cannot create the SMILES string (input title='megatryp'): An InChI 
could not be generated and used to canonise SMILES: null, file '<stdin>', line 
1, record #1: first line is '>megatryp'. Skipping.

(the --via defaults to 'sdf' so I'll omit that in the rest).

I can configure CDK SMILES writer to use the Default flavor, but without the 
'Canonical' option, to show that work-around gives a (non-canonical) SMILES:

% python -c 'print(">megatryp\n" + "W"*77 + "\n")' | chemfp translate -T rdkit 
--in fasta -U cdk --out smi -W flavor=Default,-Canonical | fold | head -2
NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)
NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)

Here I'll disable Isomeric instead, so it should be canonical but not isomeric, 
which might be okay for you: 

% python -c 'print(">megatryp\n" + "W"*77 + "\n")' | chemfp translate -T rdkit 
--in fasta -U cdk --out smi -W flavor=Default,-Isomeric | fold | head -2
O=C(O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(
NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(

That's the flavor you pass into SmilesGenerator().

Cheers,


                                Andrew
                                da...@dalkescientific.com




_______________________________________________
Cdk-user mailing list
Cdk-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/cdk-user

Reply via email to