Re: [liberationtech] Linguistics identifies anonymous users

Gregory Foster Wed, 09 Jan 2013 07:20:43 -0800

29c3 - "Stylometry and Online Underground Markets" w/ Aylin CaliskanIslam, Rachel Greenstadt, and Sadia Afroz:

http://www.youtube.com/watch?v=QRY2mfLpPCs
http://events.ccc.de/congress/2012/Fahrplan/events/5230.en.html


gf


On 1/9/13 7:34 AM, Shava Nerad wrote:

Such a framework can be social engineered as easily as SEO. I make asmall living as a ghost writer and speech writer - the informalversion of that very process. Several of my clients say my writingsounds more like them in print than they do, because they are lessfacile writers - but that is a fault that could be avoided incompetent forgeries. ;)
SN


On Jan 9, 2013 8:25 AM, "Eugen Leitl" <[email protected]> wrote:
http://www.scmagazine.com.au/News/328135,linguistics-identifies-anonymous-users.aspx

Linguistics identifies anonymous users

By Darren Pauli on Jan 9, 2013 9:49 AM

Researchers reveal carders, hackers on underground forums.

Up to 80 percent of certain anonymous underground forum users can be
identified using linguistics, researchers say.
The techniques compare user posts to track them across forums andcould even
unveil authors of thesis papers or blogs who had taken to underground
networks.

"If our dataset contains 100 users we can at least identify 80 of them,"
researcher Sadia Afroz told an audience at the 29C3 Chaos Communication
Congress in Germany.
"Function words are very specific to the writer. Even if you arewriting a
thesis, you'll probably use the same function words in chat messages.

"Even if your text is not clean, your writing style can give you away."
The analysis techniques could also reveal botnet owners, malware toolauthorsand provide insight into the size and scope of underground markets,making
the research appealing to law enforcement.

To achieve their results the researchers used techniques including
stylometric analysis, the authorship attribution framework Jstylo,and LatentDirichlet allocation which can distinguish a conversation on stolencreditcards from one on exploit-writing, and similarly help identifyinteresting
people.
The analysis was applied across millions of posts from tens ofthousands of
users of a series of multilingual underground websites including
thebadhackerz.com, blackhatpalace.com, www.carders.cc, free-hack.com,
hackel1te.info, hack-sector.forumh.net, rootwarez.org, L33tcrew.org and
antichat.ru.
It found up to 300 distinct discussion topics in the forums, withsome of the
most popular being carding, encryption services, password cracking and
blackhat search engine optimisation tools.
While successful, the work faces a series of challenges. Analysiscould only
be performed using a minimum of 5000 words (this research used the "gold
standard" of 6500 words) which culled the list of potential targetsfrom tens
of thousands to mere hundreds.

It also needs to separate discussion on product information like credit
cards, exploits and drugs from conversational text in order to facilitate
machine learning to automate the process, according to researcher Aylin
Caliskan Islam.

And posts must be translated to English, a process which boosted author
identification from 66 to around 80 per cent but was imperfect usingfreely
available tools like Google and Bing.

However both of these tasks were performed successfully, and further
development including the use of "exclusive" language translationtools would
only serve to boost the identification accuracy.
Leetspeak, an alternative alphabet popular in some forum circles,cannot be
translated.
The project is ongoing and future work promises to increase thecapacity tounmask users. This Islam said would include temporal informationwhich would
exploit users who logged into forums from the same IP addresses and wrote
posts at around the same time.

Antichat user analysis

"They might finish work, come home and log in," Islam said.
It could also tie user identities to the topics they write about andproduce
a map of their interactions, identify multiple accounts held by a single
author, and combine forum messages with internet relay chat (IRC)data sets.
"We want to automate the whole process."

Afroz said while the work appeals to law enforcements and government
agencies, it is not designed to catch users out.
"We aren't trying to identify users, we are trying to show them thatthis is
possible," she said.
To this end, the researchers released tools last year, updated lastDecember,
which help users to anonymise their writing.
One tool, Anonymouth, takes a 500 word sample of a user's writing toidentifyunique features such as function words which could make themidentifiable.
The other, JStylo, is the machine learning engine which powersAnonymouth.
The Drexel and George Mason universities research team is composed ofSadiaAfroz, Aylin Caliskan Islam, Ariel Stolerman, Rachel Greenstadt, andDamon
McCoy.


--
Gregory Foster || [email protected]
@gregoryfoster <> http://entersection.com/

--
Unsubscribe, change to digest, or change password at: 
https://mailman.stanford.edu/mailman/listinfo/liberationtech

Re: [liberationtech] Linguistics identifies anonymous users

Reply via email to