And more to the point the discussion on Reddit:
https://www.reddit.com/r/MachineLearning/comments/6m8tp0/p_deep_learning_for_estimating_race_and_ethnicity/
Bill
On 7/9/17 5:13 PM, Bill Ross wrote:
Possibly of interest:
Race and ethnicity Imputation from Disease history with Deep LEarning
https://github.com/jisungk/riddle
Bill
On 7/6/17 6:00 PM, Bill Ross wrote:
Unless the data concretely promotes discrimination, it seems
discriminatory to exclude it.
Bill
On 7/6/17 5:39 PM, Sebastian Raschka wrote:
I think there can be some middle ground. I.e., adding a new, simple
dataset to demonstrate regression (maybe autmpg, wine quality, or
sth like that) and use that for the scikit-learn examples in the
main documentation etc but leave the boston dataset in the code base
for now. Whether it's a weak argument or not, it would be quite
destructive to remove the dataset altogether in the next version or
so, not only because old tutorials use it but many unit tests in
many different projects depend on it. I think it might be better to
phase it out by having a good alternative first, and I am sure that
the scikit-learn maintainers wouldn't have anything against it if
someone would update the examples/tutorials with the use of
different datasets
Best,
Sebastian
On Jul 6, 2017, at 7:36 PM, Juan Nunez-Iglesias
<jni.s...@gmail.com> wrote:
For what it's worth: I'm sympathetic to the argument that you can't
fix the problem if you don't measure it, but I agree with Tony that
"many tutorials use it" is an extremely weak argument. We removed
Lena from scikit-image because it was the right thing to do. I very
much doubt that Boston house prices is in more widespread use than
Lena was in image processing.
You can argue about whether or not it's morally right or wrong to
include the dataset. I see merit to both arguments. But "too many
tutorials use it" is very similar in flavour to "the economy of the
South would collapse without slavery."
Regarding fair uses of the feature, I would hope that all sklearn
tutorials using the dataset mention such uses. The potential for
abuse and misinterpretation is enormous.
On 7 Jul 2017, 6:36 AM +1000, Jacob Schreiber
<jmschreibe...@gmail.com>, wrote:
Hi Tony
As others have pointed out, I think that you may be
misunderstanding the purpose of that "feature." We are in
agreement that discrimination against protected classes is not OK,
and that even outside complying with the law one should avoid
discrimination, in model building or elsewhere. However, I
disagree that one does this by eliminating from all datasets any
feature that may allude to these protected classes. As Andreas
pointed out, there is a growing effort to ensure that machine
learning models are fair and benefit the common good (such as
FATML, DSSG, etc..), and from my understanding the general
consensus isn't necessarily that simply eliminating the feature is
sufficient. I think we are in agreement that naively learning a
model over a feature set containing questionable features and
calling it a day is not okay, but as others have pointed out,
having these features present and handling them appropriately can
help guard against the model implicitly learning unfair!
!
biases (e
ven if they are not explicitly exposed to the feature).
I would welcome the addition of the Ames dataset to the ones
supported by sklearn, but I'm not convinced that the Boston
dataset should be removed. As Andreas pointed out, there is a
benefit to having canonical examples present so that beginners can
easily follow along with the many tutorials that have been written
using them. As Sean points out, the paper itself is trying to pull
out the connection between house price and clean air in the
presence of possible confounding variables. In a more general
sense, saying that a feature shouldn't be there because a simple
linear regression is unaffected by the results is a bit odd
because it is very common for datasets to include irrelevant
features, and handling them appropriately is important. In
addition, one could argue that having this type of issue arise in
a toy dataset has a benefit because it exposes these types of
issues to those learning data science earlier on and allows them
to keep these issues in mind in the futur!
e!
when the
data is more serious.
It is important for us all to keep issues of fairness in mind when
it comes to data science. I'm glad that you're speaking out in
favor of fairness and trying to bring attention to it.
Jacob
On Thu, Jul 6, 2017 at 12:08 PM, Sean Violante
<sean.viola...@gmail.com> wrote:
G Reina
you make a bizarre argument. You argue that you should not even
check racism as a possible factor in house prices?
But then you yourself check whether its relevant
Then you say
"but I'd argue that it's more due to the location (near water,
near businesses, near restaurants, near parks and recreation) than
to the ethnic makeup"
Which was basically what the original authors wanted to show too,
Harrison, D. and Rubinfeld, D.L. `Hedonic prices and the demand
for clean air', J. Environ. Economics & Management, vol.5, 81-102,
1978.
but unless you measure ethnic make-up you cannot show that it is
not a confounder.
The term "white flight" refers to affluent white families moving
to the suburbs.. And clearly a question is whether/how much was
racism or avoiding air pollution.
On 6 Jul 2017 6:10 pm, "G Reina" <gre...@eng.ucsd.edu> wrote:
I'd like to request that the "Boston Housing Prices" dataset in
sklearn (sklearn.datasets.load_boston) be replaced with the "Ames
Housing Prices" dataset
(https://ww2.amstat.org/publications/jse/v19n3/decock.pdf). I am
willing to submit the code change if the developers agree.
The Boston dataset has the feature "Bk is the proportion of blacks
in town". It is an incredibly racist "feature" to include in any
dataset. I think is beneath us as data scientists.
I submit that the Ames dataset is a viable alternative for
learning regression. The author has shown that the dataset is a
more robust replacement for Boston. Ames is a 2011 regression
dataset on housing prices and has more than 5 times the amount of
training examples with over 7 times as many features (none of
which are morally questionable).
I welcome the community's thoughts on the matter.
Thanks.
-Tony
Here's an article I wrote on the Boston dataset:
https://www.linkedin.com/pulse/hidden-racism-data-science-g-anthony-reina?trk=v-feed&lipi=urn%3Ali%3Apage%3Ad_flagship3_feed%3Bmu67f2GSzj5xHMpSD6M00A%3D%3D
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn