[
https://issues.apache.org/jira/browse/ASTERIXDB-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15323604#comment-15323604
]
Jianfeng Jia commented on ASTERIXDB-1264:
-----------------------------------------
This is the DDL:
{code}
drop dataverse twitter if exists;
create dataverse twitter if not exists;
use dataverse twitter
create type typeUser if not exists as open {
id: int64,
name: string,
screen_name : string,
lang : string,
location: string,
create_at: date,
description: string,
followers_count: int32,
friends_count: int32,
statues_count: int64
}
create type typePlace if not exists as open{
country : string,
country_code : string,
full_name : string,
id : string,
name : string,
place_type : string,
bounding_box : rectangle
}
create type typeGeoTag if not exists as open {
stateID: int32,
stateName: string,
countyID: int32,
countyName: string,
cityID: int32?,
cityName: string?
}
create type typeTweet if not exists as open{
create_at : datetime,
id: int64,
"text": string,
in_reply_to_status : int64,
in_reply_to_user : int64,
favorite_count : int64,
coordinate: point?,
retweet_count : int64,
lang : string,
is_retweet: boolean,
hashtags : {{ string }} ?,
user_mentions : {{ int64 }} ? ,
user : typeUser,
place : typePlace?,
geo_tag: typeGeoTag
}
create dataset ds_tweet(typeTweet) if not exists primary key id with filter on
create_at;
//"using" "compaction" "policy" CompactionPolicy ( Configuration )? )?
create index text_idx if not exists on ds_tweet("text") type keyword;
create index location_idx if not exists on ds_tweet(coordinate) type rtree;
// create index time_idx if not exists on ds_tweet(create_at) type btree;
create index state_idx if not exists on ds_tweet(geo_tag.stateID) type btree;
create index county_idx if not exists on ds_tweet(geo_tag.countyID) type btree;
create index city_idx if not exists on ds_tweet(geo_tag.cityID) type btree;
create feed MessageFeed using localfs(
("path"="128.195.52.77:///home/jianfeng/data/head20m.adm"),
("format"="adm"),
("type-name"="typeTweet"));
set wait-for-completion-feed "true";
connect feed MessageFeed to dataset ds_tweet;
{code}
This file feed worked OK. After the system read the adm, I am using another
socket_adpter to keep ingesting the real-time data, then the freeze scenario
happens.
The original data is too big, I upload a small sample
[here|https://drive.google.com/open?id=0B423M7wGZj9daWpCczRvalNZRkk]. Hopefully
it can reproduce the problem. Let me know if it can't, I will upload the bigger
one.
> Feed didn't release lock if the ingesting hit some exceptions
> -------------------------------------------------------------
>
> Key: ASTERIXDB-1264
> URL: https://issues.apache.org/jira/browse/ASTERIXDB-1264
> Project: Apache AsterixDB
> Issue Type: Bug
> Components: Feeds
> Reporter: Jianfeng Jia
> Assignee: Abdullah Alamoudi
>
> This is a discussed issue in the mailing list. I copy it here to make it more
> tractable and shareable.
> I hit an wield issue that is reproducible, but only if the data has
> duplications and also is large enough. Let me explained it step by step:
> 1. The dataset is very simple that only has two fields.
> DDL AQL:
> {code}
> drop dataverse test if exists;
> create dataverse test;
> use dataverse test;
> create type t_test as closed{
> fa: int64,
> fb : int64
> }
> create dataset ds_test(t_test) primary key fa;
> create feed fd_test using socket_adapter
> (
> ("sockets"="nc1:10001"),
> ("address-type"="nc"),
> ("type-name"="t_test"),
> ("format"="adm"),
> ("duration"="1200")
> );
> set wait-for-completion-feed "false";
> connect feed fd_test to dataset ds_test using policy AdvancedFT_Discard;
> {code}
> ——————————————————————————————
> That AdvancedFT_Discard policy will ignore the exception from the insertion
> and keep ingesting.
> 2. Ingesting the data by a very simple socked adapter which reads the record
> one by one from an adm file. The src is
> here:https://github.com/JavierJia/twitter-tracker/blob/master/src/main/java/edu/uci/ics/twitter/asterix/feed/FileFeedSocketAdapterClient.java
> The data and the app package is provided here:
> https://drive.google.com/folderview?id=0B423M7wGZj9dYVQ1TkpBNzcwSlE&usp=sharing
> To feed the data you can run:
> ./bin/feedFile -u 172.17.0.2 -p 10001 -c 5000000 ~/data/twitter/test.adm
> -u for sever url
> -p for server port
> -c for count of line you want to ingest
> 3. After ingestion, all the requests about the ds_test was hanging. There is
> no exception and no responds for hours. However it can respond any other
> queries that on other datasets, like Metadata.
> That data contains some duplicated records which should trigger the insert
> exception. If I change the count from 5000000 to lower, let’s say 3000000, it
> has no problems, although it contains duplications as well.
> Answer from [~amoudi] :
> I know exactly what is going on here. The problem is you pointed out is
> caused by the duplicate keys. If I remember correctly, the main issue is
> that locks that are placed on the primary keys are not released.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)