Hello,

https://en.wikipedia.org/wiki/Truncated_regression_model

Sometimes, data have missing samples when the target variable
is above or below a threshold value.
This is very often the case for biochemical data (e.g. target
variable outside detection range of some lab equipment).

I highly suspect some specific models could handle such datasets
better than generic methods (i.e. train better models).

Some points of entry, if that might help:

- R has a truncreg package
  https://cran.r-project.org/web/packages/truncreg/index.html
- a related paper from the wikipedia page:
  "Local likelihood estimation of truncated regression and
  its partial derivatives: Theory and application"
https://hal.archives-ouvertes.fr/hal-00520650/file/PEER_stage2_10.1016%252Fj.jeconom.2008.08.007.pdf

I can provide a cleaned public regression dataset, if someone is interested, for tests (there are many such datasets in ChEMBL and PubChem by the way, but you need to know how
to "featurize"/encode molecules).

Regards,
F.
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to